Font Size: a A A

Research On The Key Technology Of Text Clustering

Posted on:2015-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:C L WangFull Text:PDF
GTID:2298330431482860Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the continuous development of the Internet, text information on the network showed explosive growth, it is become a serious problem to find, organize and use the useful information behind of the huge number of text. As a preprocessing step of natural language processing, text clustering is a comprehensive product of information retrieval and data mining techniques. As the starting point of the text mining processing, text clustering has a very significant impact on improving the validity and accuracy of the later text analysis. In recent years it has become a research hotspot. Currently, the classical text clustering algorithm can be divided into division method, hierarchical method, grid-based method, density-based methods, and model-based method. For the application like large scale text processing which is time-consuming, division method has a lower processing complexity and relatively wide application. And division method includes K-means, K-prototypes, K-medoids and so on. Where, K-means is one of the more commonly used algorithms.Firstly, this paper makes a brief introduction about the related knowledge of the text mining, and discusses the domestic status of research in the field of text mining, as well as makes a summary of the current research results; Then, makes a thorough analysis of the related technology about the text clustering, introduces several representative text clustering algorithms briefly, and focused on the traditional K-means algorithm, which is widely used in text clustering applications. However, the algorithm is very sensitive to isolated samples, it has an increasing number of iterations, falls into local optimal solution and unstable clustering since the initial cluster centers are randomly selected. To solve these problems, an initial clustering centers selection algorithm based on Latent Dirichlet Allocation (LDA) model for the K-means algorithm is proposed. In this improved algorithm, the top-m most important topics of the text corpora are first selected. Then, the text corpora is preliminarily clustered based on the m dimensions of topics. As a result, m cluster centers can be got in the algorithm, which are used to further make clustering on all the dimensions of the text corpora. Theoretically, the center for each cluster can be determined based on the probability without randomly selecting them. The experiment demonstrates that the clustering results of improved algorithm are more accurate with less number of iterations. Finally, the paper points out the text clustering model trends and outlook for the existing challenging issues of the text clustering areas.
Keywords/Search Tags:topic model, K-means, cluster center, text clustering, LDA
PDF Full Text Request
Related items