Similarity Measures And New Clustering Methods For Categorical Sequences

Posted on:2016-11-28

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhang

Full Text:PDF

GTID:2308330473956956

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Clustering is an unsupervised machine learning method; it is widely used in machine vision, information retrieval and pattern extraction, and many other areas of data mining. In scientific research and commercial applications, the categorical sequence data is increasing, common categorical sequences are: DNA sequence, protein sequences in the field of bioinformatics and speech sequences in the field of speech recognition, etc. Clustering for categorical sequence, therefore, become a hot study at present.By the non-numerical characteristics of categorical sequence, the length of the differences and the influence of the complex connections between symbols, the traditional similarity measure based on the numerical method cannot be directly applied to symbol sequence similarity measure, this makes the categorical sequence clustering a challenging task. In addition to an effective similarity measure, categorical sequence clustering also needs an effective clustering algorithm.In this paper, the current mainstream of categorical sequence similarity measure methods are analyzed and study of sequence similarity measure method to consider and solve the problems, then propose the standardization of similarity measure method and method of sequence similarity based on global subsequence similarity; based on single link condensed hierarchy clustering problems, proposes a clustering algorithm based on the partition and the building of no circuit connected graph, which has theoretical significance and important practical application value. In this paper, the main work and contributions are as follows:(1) No-moralized similarity measure algorithm is proposed, combines sequence alignment and the canonical factor, which reflects the sequence alignment algorithm by the local and global information, the canonical factor effectively to reduce the length of the sequence to sequence similarity bias.(2) In view of the existing sequence similarity measure method based on the subsequence similarity lack of global information, the introduction of global information contains the sequence of symbols entropy, the entropy is proposed based on symbols of subsequence similarity measurement method, based on this; proposes a sequence similarity measure method based on dynamic programming.(3) For widely used in the single link categorical sequence clustering condensed the disadvantages of hierarchical clustering algorithm, a clustering algorithm is proposed based on the construction and partition of a no circuit and connected graph, a new clustering algorithm combined with the above two sequence similarity measure method respectively gets two new symbol sequence clustering algorithm, both of them effectively improve the clustering accuracy.

Keywords/Search Tags:

categorical sequence, similarity, normalization, entropy, hierarchical clustering

PDF Full Text Request

Related items

1	Studies On Clustering Algorithms For Categorical Data
2	Studies On Hierarchical Clustering For Categorical Data
3	ESCHCD: Entropy-based Algorithm For Subspace Clustering With High Dimensional Categorical Datasets
4	Study Of Algorithms For Clustering Categorical Data
5	Research On Hierarchical Clustering Algorithm Based On Silhouette
6	Research On Subspace Clustering Algorithm For Categorical Data
7	Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement
8	Categorical Relation Graph Construction And Clustering Analysis For Categorical Data
9	Research On Mutual Information Hierarchical Clustering Based On Grassberger Entropy Estimator
10	Implementation And Application Of Global-Relationship Similarity Measure In Clustering