Research On Non-parameterized Clustering Algorithm And Its Application In Text Clustering

Posted on:2021-04-24

Degree:Master

Type:Thesis

Country:China

Candidate:J S Chen

Full Text:PDF

GTID:2428330647958926

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development of data mining technology,clustering have been used to analyze data sets with complex structures and multiple types of that appear in real life.As an unsupervised machine learning method,clustering does not require pre-training and manual annotation of the data set,so it has high automatic processing capabilities and has been widely used in Internet information retrieval and other fields.Among them,text is the basic format of all information at present,so text clustering is the most worthy of attention and research in many current clustering application scenarios.As a classic clustering algorithm,K-means algorithm is widely used because of its advantages that the principle is simple and easy to describe.However,this type of algorithm also has obvious shortcomings: the accuracy and computational complexity of clustering depends heavily on the initial parameters(such as the number of clusters and cluster centers).In Web2.0 period of today,the amount of information and text has increased dramatically.In a large number of practical application scenarios,the data set is not only large in scale,but always in the process of dynamic changes.Therefore,some necessary initial parameters are often difficult to predict and definite in advance.In response to the above problems,the main contributions and innovations of this thesis are summarized as follows:1.A novel minimum spanning tree based non-parameterized clustering algorithm,named MNC(MST based Non-parameterized Clustering),is proposed.The so-called non-parameterized clustering refers to that when clustering,only the data samples to be clustered need to be input and excludes any other parameters.The key idea of MNC algorithm is:Firstly,the data set is abstracted into a Weighted Complete Graph(WCG),where the points represent the data samples and the weighted edges represent the similarity relationship between the samples.Then the WCG is converted to the fully connected Minimum Spanning Tree(MST)and the pruning threshold is generated by the traditional k=2 clustering of the MST's one-dimensional weight space.Finally,the MST is pruned and noise filtered,with the resultant connected components correspond to the output clusters.2.The MNC algorithm is combined with classical text preprocessing techniques such as Chinese word segmentation and TF-IDF text representation model,and a complete text clustering library(Py TCL,Python based Text Clustering Library)has been successfully developed using Python.3.The validity of the MNC algorithm is verified by using a visual two-dimensional random data set,a classic UCI data set,and a real text data set.By comparing the clustering effects of different clustering algorithms,the efficiency of the MNC algorithm and the utility of the PyTCL library are verified.

Keywords/Search Tags:

non-parameterized clustering, K-means algorithm, minimum spanning tree, one-dimensional weight space, text clustering

PDF Full Text Request

Related items

1	Research And Application Of Web Chinese Text Clustering Algorithm Based On Minimum Spanning Tree
2	Research On Clustering Algorithm Based On Minimum Spanning Tree
3	KK-means Clustering Method Improved Based-on Minimum Cost Spanning Tree And Its Applications In Seismic Data
4	Research On Clustering Algorithm Based On Minimum Spanning Tree
5	Research Of Clustering Algorithms Based On Minimum Spanning Tree
6	The Research On Dynamic And Abstract Clustering Method Of High Dimensional Sparse Data
7	The Three-Dimensional Index Structure Of R~*-tree Based On The Minimum Bounding Box And The Adaptive Clustering
8	Research And Application Of Graph Partition Clustering Algorithms Based On Cell-like P System
9	Research And Application Of Improved Minimum Spanning Tree Clustering Algorithm Based On Membrane Computing
10	Research On Clustering Algorithm Based On Tree Center Of Gravity And Cut Edge Constraints