Font Size: a A A

Research On Clustering Algorithm In Text Mining

Posted on:2010-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y TanFull Text:PDF
GTID:2178360272996865Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In real life, mass data are produced continually in many fields. How to extract andmine useful information and knowledge have become more and more important problem.Text Mining is a kind of computer technology which can extract valuable information andknowledge from text data. It also is a branch of Data Mining.With the technology of TextClustering, a great deal of information will be divided into significant classes or clustersduring the process of Text Mining. Therefore Text Clustering has been a core technology inText Mining. Text Clustering is one of the chief means of Text Mining. It can be used inmining theWeb pages for it can further cluster a large number of Web pages searched bySearch Engines according to their page contents. It should be said that mining data ofInternet has become an important research field of Text Mining.We mainly study two parts content in this paper. One is the comparative study onclustering algorithms based on experiments. The other is design and realization of the textclustering system based on the Marked Corpus of People Daily.The detailed tasks are as follows.(1) First, we introduce the purpose of this paper and the background of Text Miningand Text Clustering. And then we introduce the basic theories of clustering, the generalmodeling procedures and the key technologies.(2) Comparative study on clustering algorithms based on experiments is carried on.Besides text preprocess, the choice of clustering algorithm is also a vital step in TextClustering. There are four normal clustering algorithms in Text Clustering, such asK-means based on distance, hierarchical method, DBSCAN based on density and FCMfuzzy clustering algorithm. In this paper, we test these four algorithms by the samedata set, and analyze the test results with tables and curve figures. Then we adjust theparameters of each algorithm which influence the clustering results. After severalexperiments, we find that: (a) the clustering result of K-means algorithm is influenced bythe parameters K, and the parameter of random seed is proportional to the iteration times. (b)the clustering result of hierarchical method is influenced by the parameter Cluster Cutoff,the smaller is the parameter, the more is the noisy data separated. (c) DBSCAN algorithmbased on density is closely linked with the two parameters Minpts and Eps. If Eps'sthreshold value is too small, a cluster will be divided into several little clusters; ifEps's threshold value is too large, several clusters apart with a relatively long distancewill be merged. (d) considering the two parameters the smallest Fc(U) and the largest Dc(U) together, the fuzzy clustering algorithm can choose the optimal number ofcluster. At last, four clustering algorithms are evaluated regarding to their relevantperformances, such as the sensitivity to the input sequence, optimal ability of the clusteringresults, and adaptive capacity to dynamic data, and so on.(3) Design and realization of the text clustering system based on the Marked Corpus ofPeople Daily.(a) Design of the text clustering model. The first step breaking is to statistics the termfrequency for the input documents which have been processed by term -down. At the sametime, we must filter the stop words irrespective with content and low frequency words. Thetext information is expressed as feature vector based on Vector Space Model (VSM), inwhich the TF-IDF weight calculation and the cosine similarity are used together. Whilecomparing two documents, the cosine similarity is applied to judge the similarity betweenthem. After that, cluster the documents with standard K-means algorithm, and output theclustering results.(b) There are two parts in the designed experiments of this paper: In the first part ofthe experiments, the parameter K of K-means algorithm is set to four different values. Afterclustering, the clustering results will be compared with the results of manual classification.In the second part, experiments are designed to verify the influence on the clustering resultscaused by the random seeds and the sequence of the input documents. These 200 pieces ofdocuments used in the two parts of the experiments are all come from the Marked Corpusof People's Daily.The experiment results of the first part show that, when the parameter K is set to be 2to 3, the documents can be roughly distinguished from each other, but the effect is notobvious and the clustering purity is only 0.325 and 0.39. When the parameter K is set to be4, and the number of input is equal to the number of output class, we get the highestaccuracy 0.515, and the iteration times is reduced to 6, meanwhile the running time is alsoshorten. However, not all input documents are mapped to the corresponding cluster, andonly about half of the documents are correctly clustered; when the parameter K is set to be5, with the number of cluster increasing, the set of input documents is divided into smallersubsets, and the accuracy of clustering declines. The purity is only 0.425 and the iterationtimes raises to 9. After repeated experiments and comparison, we make a choice that theoptimal cluster number is 4.In the second part, two experiments about the random seed show that changes of theparameters in random function don't influence the accuracy of clustering, but influence theiteration times and clustering runtime. The larger is the parameter in random function, thelonger is the runtime. Therefore we draw a conclusion that Random (1) is more appropriateto the random function in K-means algorithm, because it neither influences the accuracy ofclustering, nor increases the system runtime. Regarding to the influence on clusteringresults caused by the sequence of input documents, we find the sequence of input documents does not influence the accuracy of the clustering after comparing three groups ofexperiments. So K-means algorithm is insensitive to the sequence of input documents.As can be seen from the above experiments, evaluation for the clustering results willeventually guide the user to adjust the model and parameters and thereby to select theoptimal cluster model and the best parameters.In this paper, we design and realize the text clustering model based on the MarkedCorpus of People Daily. This model can also be used forWeb documents clustering. Suchas it can further cluster theWeb pages searched by Search Engine to improve the efficiencyof searching the target content. Moreover, it can also cluster the Web pages that usersinterested in to find user's interesting pattern which can be used in information filtering andother services. Therefore, this model has a certain applied value.Due to the limited time and the level of the author, related research work involved inthis paper is not perfect and needs further study. In the future work, two aspects should beconsidered: Firstly, the clustering algorithm of the existing model needs to be improved, inorder to improve the accuracy of clustering. Or other clustering algorithm can be appliedinto this system, such as the hierarchical algorithm. Secondly, the text clustering modelproposed in this paper needs to be completed. It can be expanded into a clustering systembased on Web documents, which will play an active role in improving the efficiency of Webinformation retrieval.
Keywords/Search Tags:text mining, clustering algorithm, text clustering, K-means algorithm
PDF Full Text Request
Related items