Font Size: a A A

Research Of Text Clustering Based On Improved Birch

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:X F YangFull Text:PDF
GTID:2248330398956990Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the high-speed development of Internet, the amount of data in the network has been becoming larger and larger, mostly in the form of text. How to deal with the huge amount of data effectively and find out some useful information, has become an urgent problem to be solved. The research related to text mining has been more and more attention as an important research to address this issue, become a hot research field of data mining. The text data is different from the numerical data, and it is more complicated which brings a greater difficulty to the relevant research. In researches of text mining, due to the dependence on the sample data, practical application of text classification is not easy to get a good treatment effect. As an unsupervised text classification approach, text clustering is becoming a popular field of text mining technology researches.Text clustering is a text processing application of clustering algorithm, its core content is a clustering algorithm, which is also the focus of this study. With the full study of technical processes of text clustering, in order to improve the effect of text clustering, and to mend the flaws of traditional clustering algorithm in parameter setting and algorithm stability, a new text clustering algorithm TCBIBK(a Text Clustering algorithm Based on Improved BIRCH and K-nearest neighbor) is presented. TCBIBK uses BIRCH clustering algorithm as the prototype. During the process of clustering, besides analyzing the distance between text objects and clusters, TCBIBK also analyzes the distance between clusters and clusters, takes the active cluster merging or segmentation, and sets the dynamic threshold. Combined with KNN classification algorithm, TCBIBK improves the algorithm stability under the premise of ensuring the good efficiency of clustering. When applied to text clustering, TCBIBK can improve the text clustering effect.Finally, the improved algorithm is implemented with the Java programming language, and text clustering experiments are designed in different size text sets. The experiments’results are compared with the traditional K-means algorithm and Chameleon algorithm which does good job in clustering as a hierarchical method. The results of experiment show that this algorithm can greatly improve the validity and stability of text clustering.
Keywords/Search Tags:Text clustering, Vector space model, BIRCH, K-nearest neighbor, F1-measure
PDF Full Text Request
Related items