Research Of Text Clustering Based On Improved Birch

Posted on:2014-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:X F Yang

Full Text:PDF

GTID:2248330398956990

Subject:Computer application technology

Abstract/Summary:

In recent years, with the high-speed development of Internet, the amount of data in the network has been becoming larger and larger, mostly in the form of text. How to deal with the huge amount of data effectively and find out some useful information, has become an urgent problem to be solved. The research related to text mining has been more and more attention as an important research to address this issue, become a hot research field of data mining. The text data is different from the numerical data, and it is more complicated which brings a greater difficulty to the relevant research. In researches of text mining, due to the dependence on the sample data, practical application of text classification is not easy to get a good treatment effect. As an unsupervised text classification approach, text clustering is becoming a popular field of text mining technology researches.Text clustering is a text processing application of clustering algorithm, its core content is a clustering algorithm, which is also the focus of this study. With the full study of technical processes of text clustering, in order to improve the effect of text clustering, and to mend the flaws of traditional clustering algorithm in parameter setting and algorithm stability, a new text clustering algorithm TCBIBK(a Text Clustering algorithm Based on Improved BIRCH and K-nearest neighbor) is presented. TCBIBK uses BIRCH clustering algorithm as the prototype. During the process of clustering, besides analyzing the distance between text objects and clusters, TCBIBK also analyzes the distance between clusters and clusters, takes the active cluster merging or segmentation, and sets the dynamic threshold. Combined with KNN classification algorithm, TCBIBK improves the algorithm stability under the premise of ensuring the good efficiency of clustering. When applied to text clustering, TCBIBK can improve the text clustering effect.Finally, the improved algorithm is implemented with the Java programming language, and text clustering experiments are designed in different size text sets. The experimentsâ€™results are compared with the traditional K-means algorithm and Chameleon algorithm which does good job in clustering as a hierarchical method. The results of experiment show that this algorithm can greatly improve the validity and stability of text clustering.

Keywords/Search Tags:

Text clustering, Vector space model, BIRCH, K-nearest neighbor, F1-measure

Related items

1	Design And Implementation Of The Technical Text Categorization System
2	Study On Generalized Nearest Neighbor Pattern Classification
3	Research On Text Clustering Algorithm Based On Spectral Clustering
4	The Research And Application Of Clustering Algorithm Based On Density
5	Research On Spatial Queries For Moving Objects In Indoor Space
6	Information Filtering Systems Based On Web Text Content And Design,
7	Design And Realization Of Text Categorization System
8	Researches And Implements Of Text Filtering For Physical GAP
9	Automatic Classification Research On Chinese Web Document Orientation
10	Support Vector Clustering Based On Shared Nearest Neighbor