Font Size: a A A

Research On Text Clustering Based On Division And Hierarchy

Posted on:2013-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LiuFull Text:PDF
GTID:2248330371470080Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since more and more useful information is in the form of text now,and then how to clusterand classify these large-scale text message quickly is becoming more important.In order to solvethe problem,automatic text clustering and the automatic classification arises at the historicmoment. Text clustering technique combined with machine learning and statistical method oftheory,it will be divided into different categories of text,the text information classified formassive problems can be a good way to solve and has been widely used.The current textclustering research mainly based on text representation and clustering algorithm.Representationof the text,the text after pretreatment process for text representation space generally have sparsesolution and higher dimensional characteristic,leading to text clustering quality decline and a lossof efficiency.In the cluster of the algorithm,the text clustering mainly has K-Means algorithm,K-Medoids algorithm, CURE algorithm, BIRCH algorithm, DBSCAN algorithm and etc, how toimprove these algorithms, making them better suited to improve the quality and efficiency of textclustering is also a hot issue now.In this paper,the background of text clustering , basic theoryand research progress are briefly introduced in the introduction,then introduced the basicalgorithm thought,algorithm types and theoretical basis of the clustering algorithm in the textclustering techniques. And the evaluation standard of text clustering algorithm and commonrecognized data set,and introduces the text clustering process used in the key technology.Then onthe basis of the research and analysis of the existence of text clustering and current Institute, thispaper studies the two questions: First,how to improve the text clustering algorithm to improvethe performance of the clustering device; the second it is based on the improved research ofclustering algorithm, tests of validity of improvement of text by means of the contrastexperiment..The works proposed in this paper are as follows:1.K-Means algorithm is one of the most commonly used text clustering algorithm. Thecomplexity of the algorithm is relatively low, and the algorithm is achieved simple, which hasmore obvious shortcomings: the algorithm for the choice of the center of initial cluster is toosensitive, which will lead to instability of the clustering Based on the above shortcomings, theinitial clustering approach based on variable threshold center selection is presented in this paper.When selecting initial cluster centers, the initial cluster center, which initial focal point distanceis greater than an uncertain threshold, is selected an the next sample. The uncertain threshold canbe adjusted until meeting the threshold conditions according to the number of initial clustercenters on the uncertain threshold, and then cluster operation can be operated. The experimentalresults on 10 UCI data sets and four text data sets show that the performance of our algorithm arebetter than the K-Means algorithm.2.Hierarchical cluster analysis is one of the very important research topic in the field of datamining and pattern recognition, and it has a very broad application prospects. A novelhierarchical clustering algorithm using information gain is proposed in this paper inspired by the choice of the best classification attributes in decision tree learning. The properties of samples canbe operated by weighting according to introducing the information gain method. The originalhierarchical clustering algorithm can be guided further to improve the quality of clusteringresults. The experimental results on 10 UCI data sets and four text data sets show that theperformance of our algorithm is better than the original hierarchical clustering algorithm.
Keywords/Search Tags:variable threshold, K-Means, initial cluster center, information gain, hierarchical clustering
PDF Full Text Request
Related items