Research On Text Clustering Based On Division And Hierarchy

Posted on:2013-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Liu

Full Text:PDF

GTID:2248330371470080

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Since more and more useful information is in the form of text now,and then how to clusterand classify these large-scale text message quickly is becoming more important.In order to solvethe problem,automatic text clustering and the automatic classification arises at the historicmoment. Text clustering technique combined with machine learning and statistical method oftheory,it will be divided into different categories of text,the text information classified formassive problems can be a good way to solve and has been widely used.The current textclustering research mainly based on text representation and clustering algorithm.Representationof the text,the text after pretreatment process for text representation space generally have sparsesolution and higher dimensional characteristic,leading to text clustering quality decline and a lossof efficiency.In the cluster of the algorithm,the text clustering mainly has K-Means algorithm,K-Medoids algorithm, CURE algorithm, BIRCH algorithm, DBSCAN algorithm and etc, how toimprove these algorithms, making them better suited to improve the quality and efficiency of textclustering is also a hot issue now.In this paper,the background of text clustering , basic theoryand research progress are briefly introduced in the introduction,then introduced the basicalgorithm thought,algorithm types and theoretical basis of the clustering algorithm in the textclustering techniques. And the evaluation standard of text clustering algorithm and commonrecognized data set,and introduces the text clustering process used in the key technology.Then onthe basis of the research and analysis of the existence of text clustering and current Institute, thispaper studies the two questions: First,how to improve the text clustering algorithm to improvethe performance of the clustering device; the second it is based on the improved research ofclustering algorithm, tests of validity of improvement of text by means of the contrastexperiment..The works proposed in this paper are as follows:1.K-Means algorithm is one of the most commonly used text clustering algorithm. Thecomplexity of the algorithm is relatively low, and the algorithm is achieved simple, which hasmore obvious shortcomings: the algorithm for the choice of the center of initial cluster is toosensitive, which will lead to instability of the clustering Based on the above shortcomings, theinitial clustering approach based on variable threshold center selection is presented in this paper.When selecting initial cluster centers, the initial cluster center, which initial focal point distanceis greater than an uncertain threshold, is selected an the next sample. The uncertain threshold canbe adjusted until meeting the threshold conditions according to the number of initial clustercenters on the uncertain threshold, and then cluster operation can be operated. The experimentalresults on 10 UCI data sets and four text data sets show that the performance of our algorithm arebetter than the K-Means algorithm.2.Hierarchical cluster analysis is one of the very important research topic in the field of datamining and pattern recognition, and it has a very broad application prospects. A novelhierarchical clustering algorithm using information gain is proposed in this paper inspired by the choice of the best classification attributes in decision tree learning. The properties of samples canbe operated by weighting according to introducing the information gain method. The originalhierarchical clustering algorithm can be guided further to improve the quality of clusteringresults. The experimental results on 10 UCI data sets and four text data sets show that theperformance of our algorithm is better than the original hierarchical clustering algorithm.

Keywords/Search Tags:

variable threshold, K-Means, initial cluster center, information gain, hierarchical clustering

PDF Full Text Request

Related items

1	Differentially Private K-means Clustering
2	Research On The Selection Of Initial Cluster Centers In K-means Algorithm
3	The Study And Development Of Hierarchical-K-means-Based Clustering Algorithm
4	Ksummary Analysis Method Based On Adaptive Multiple Clustering
5	Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data
6	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
7	The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors
8	Research On Improvement Of K-means Algorithm For Micro-blogging Information
9	Improved K-means Algorithm Based On Optimizing Initial Cluster Centers
10	Improvements And Implementation Of K-means Clustering Algorithm