Font Size: a A A

Similarity Measures And New Clustering Methods For Categorical Sequences

Posted on:2016-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2308330473956956Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is an unsupervised machine learning method; it is widely used in machine vision, information retrieval and pattern extraction, and many other areas of data mining. In scientific research and commercial applications, the categorical sequence data is increasing, common categorical sequences are: DNA sequence, protein sequences in the field of bioinformatics and speech sequences in the field of speech recognition, etc. Clustering for categorical sequence, therefore, become a hot study at present.By the non-numerical characteristics of categorical sequence, the length of the differences and the influence of the complex connections between symbols, the traditional similarity measure based on the numerical method cannot be directly applied to symbol sequence similarity measure, this makes the categorical sequence clustering a challenging task. In addition to an effective similarity measure, categorical sequence clustering also needs an effective clustering algorithm.In this paper, the current mainstream of categorical sequence similarity measure methods are analyzed and study of sequence similarity measure method to consider and solve the problems, then propose the standardization of similarity measure method and method of sequence similarity based on global subsequence similarity; based on single link condensed hierarchy clustering problems, proposes a clustering algorithm based on the partition and the building of no circuit connected graph, which has theoretical significance and important practical application value. In this paper, the main work and contributions are as follows:(1) No-moralized similarity measure algorithm is proposed, combines sequence alignment and the canonical factor, which reflects the sequence alignment algorithm by the local and global information, the canonical factor effectively to reduce the length of the sequence to sequence similarity bias.(2) In view of the existing sequence similarity measure method based on the subsequence similarity lack of global information, the introduction of global information contains the sequence of symbols entropy, the entropy is proposed based on symbols of subsequence similarity measurement method, based on this; proposes a sequence similarity measure method based on dynamic programming.(3) For widely used in the single link categorical sequence clustering condensed the disadvantages of hierarchical clustering algorithm, a clustering algorithm is proposed based on the construction and partition of a no circuit and connected graph, a new clustering algorithm combined with the above two sequence similarity measure method respectively gets two new symbol sequence clustering algorithm, both of them effectively improve the clustering accuracy.
Keywords/Search Tags:categorical sequence, similarity, normalization, entropy, hierarchical clustering
PDF Full Text Request
Related items