Research On The Key Technology Of Text Clustering

Posted on:2015-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:C L Wang

Full Text:PDF

GTID:2298330431482860

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

As the continuous development of the Internet, text information on the network showed explosive growth, it is become a serious problem to find, organize and use the useful information behind of the huge number of text. As a preprocessing step of natural language processing, text clustering is a comprehensive product of information retrieval and data mining techniques. As the starting point of the text mining processing, text clustering has a very significant impact on improving the validity and accuracy of the later text analysis. In recent years it has become a research hotspot. Currently, the classical text clustering algorithm can be divided into division method, hierarchical method, grid-based method, density-based methods, and model-based method. For the application like large scale text processing which is time-consuming, division method has a lower processing complexity and relatively wide application. And division method includes K-means, K-prototypes, K-medoids and so on. Where, K-means is one of the more commonly used algorithms.Firstly, this paper makes a brief introduction about the related knowledge of the text mining, and discusses the domestic status of research in the field of text mining, as well as makes a summary of the current research results; Then, makes a thorough analysis of the related technology about the text clustering, introduces several representative text clustering algorithms briefly, and focused on the traditional K-means algorithm, which is widely used in text clustering applications. However, the algorithm is very sensitive to isolated samples, it has an increasing number of iterations, falls into local optimal solution and unstable clustering since the initial cluster centers are randomly selected. To solve these problems, an initial clustering centers selection algorithm based on Latent Dirichlet Allocation (LDA) model for the K-means algorithm is proposed. In this improved algorithm, the top-m most important topics of the text corpora are first selected. Then, the text corpora is preliminarily clustered based on the m dimensions of topics. As a result, m cluster centers can be got in the algorithm, which are used to further make clustering on all the dimensions of the text corpora. Theoretically, the center for each cluster can be determined based on the probability without randomly selecting them. The experiment demonstrates that the clustering results of improved algorithm are more accurate with less number of iterations. Finally, the paper points out the text clustering model trends and outlook for the existing challenging issues of the text clustering areas.

Keywords/Search Tags:

topic model, K-means, cluster center, text clustering, LDA

PDF Full Text Request

Related items

1	A Research Of Developed Algorithms About Text Cluster Center Choose
2	Research On Text Clustering And Its Application In Topic Detection Analysis
3	Reasearch On The Topic Clustering Of Network Short Text
4	Research On Text Clustering Based On Division And Hierarchy
5	Analysis Of Network Public Opinion Data Based On Short Text Clustering
6	Design And Implementation Of Distributed Text Clustering System Based On K-means
7	Improvement Of K-Means Algorithm And Its Application In Weibo Topic Discovery
8	Research And Implementation Of Text Clustering Based On Dk-means
9	Research And Implementation Of Text Clustering Based On DK-Means
10	Research On Keyword Extraction Algorithm For Chinese Texts And Cluster Center Point Selection Algorithm In Text Clustering