Font Size: a A A

The Research And Application In Text Clustering Of K-Means Algorithm

Posted on:2014-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:B L ChenFull Text:PDF
GTID:2248330398979121Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the process of storing large amounts of textual information becomes easier. Simultaneously, the number of available documents on the Web is growing rapidly. When the amount of usable information continues growing, the abilities of users’ understanding and managing remain unchanged. Naturally, problems, such as how to find out the interesting one from so much information and how to categorize these unclassified text information, involve a new research direction-text mining. As one of the most important research branches of text mining, text clustering mining means a method to find the categories information and inclusion of text corpus, which divide the text document into different specified categories according to the standard of similarity metrics. All this makes each class has a higher similarity and also gives the corresponding overview description for each category. Comparing with ordinary experimental data clustering, text clustering mining has its own unique characteristic, so it is a great challenge field for pursers. Currently, researches and applications on the K-Means algorithm are increasing, especially in text clustering mining.In this article, we introduced the basic theory of cluster analysis and text clustering mining firstly, and then put forward own improved method that aiming at the limitation of K-Means algorithm, finally improved K-Means algorithm applied to text clustering mining.Firstly, this article summarizes the background of clustering algorithm and text clustering mining research and achievements at home and abroad. It is well developed abroad, while the domestic counterpart is still at theory research stage. Then, the theory of data mining is briefly introduced, including the concept and steps of data mining.On the base of introducing the theory of clustering analysis such as the concept of clustering and clustering algorithm, this thesis explained K-Means algorithm and its advantage and disadvantage emphatically. Aiming at some related problems, such as the impact of isolated point and the selection of the initial cluster centers in the original algorithm, the improved K-Means algorithm is proposed and outlier analysis is introduced at the same time. The outlier analysis mainly uses the statistics thought that a data is isolated when the absolute value of the Z-scores (standard scores) greater than2. This method not only has a strict mathematical theory basis, but also avoids the necessary precondition that the user should set a threshold. The strategy to determine the initial cluster centers is that dividing out relatively centralized data firstly each time, which can guarantee there is a striking similarity in the samples of each class. Outlier detection can reduce the influence of outlier clustering. At the same time, the initial cluster centers selection strategy in the improved K-Means algorithm can not only reduce the possibilities that the algorithm fall into a local optimum, but to some extent reduce the number of the algorithm iterations. Then, the paper experiment the improved algorithm by using the iris data set. Comparing with original algorithms, results verify that the effectiveness and performance of the improved K-Means algorithm have been increased greatly.Next, this paper introduces the concept of text mining and text mining process, and implements an application example of text clustering that based on the improved K-Means algorithm. The application contains three modules which including text preprocessing module, clustering module and performance evaluation module. Detailed design ideas and a brief code structure are given for each model. In the process of implementation, putting forward "trade space for time" performance optimization program to deal with Tf-idf value compute in the data preprocessing module, and giving its own calculation method for the accuracy of calculation in the performance evaluation module. Then, the thesis applies the previous application in text classification corpus text data set of the Sogou laboratory, and shows the results of text clustering mining.Finally, the tag concludes this paper, finds some questions until later resolved and puts forward the future research directions of cluster mining.
Keywords/Search Tags:K-Means algorithm, data preprocessing, text clustering mining
PDF Full Text Request
Related items