Font Size: a A A

Research On Non-parameterized Clustering Algorithm And Its Application In Text Clustering

Posted on:2021-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:J S ChenFull Text:PDF
GTID:2428330647958926Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of data mining technology,clustering have been used to analyze data sets with complex structures and multiple types of that appear in real life.As an unsupervised machine learning method,clustering does not require pre-training and manual annotation of the data set,so it has high automatic processing capabilities and has been widely used in Internet information retrieval and other fields.Among them,text is the basic format of all information at present,so text clustering is the most worthy of attention and research in many current clustering application scenarios.As a classic clustering algorithm,K-means algorithm is widely used because of its advantages that the principle is simple and easy to describe.However,this type of algorithm also has obvious shortcomings: the accuracy and computational complexity of clustering depends heavily on the initial parameters(such as the number of clusters and cluster centers).In Web2.0 period of today,the amount of information and text has increased dramatically.In a large number of practical application scenarios,the data set is not only large in scale,but always in the process of dynamic changes.Therefore,some necessary initial parameters are often difficult to predict and definite in advance.In response to the above problems,the main contributions and innovations of this thesis are summarized as follows:1.A novel minimum spanning tree based non-parameterized clustering algorithm,named MNC(MST based Non-parameterized Clustering),is proposed.The so-called non-parameterized clustering refers to that when clustering,only the data samples to be clustered need to be input and excludes any other parameters.The key idea of MNC algorithm is:Firstly,the data set is abstracted into a Weighted Complete Graph(WCG),where the points represent the data samples and the weighted edges represent the similarity relationship between the samples.Then the WCG is converted to the fully connected Minimum Spanning Tree(MST)and the pruning threshold is generated by the traditional k=2 clustering of the MST's one-dimensional weight space.Finally,the MST is pruned and noise filtered,with the resultant connected components correspond to the output clusters.2.The MNC algorithm is combined with classical text preprocessing techniques such as Chinese word segmentation and TF-IDF text representation model,and a complete text clustering library(Py TCL,Python based Text Clustering Library)has been successfully developed using Python.3.The validity of the MNC algorithm is verified by using a visual two-dimensional random data set,a classic UCI data set,and a real text data set.By comparing the clustering effects of different clustering algorithms,the efficiency of the MNC algorithm and the utility of the PyTCL library are verified.
Keywords/Search Tags:non-parameterized clustering, K-means algorithm, minimum spanning tree, one-dimensional weight space, text clustering
PDF Full Text Request
Related items