Document Clustering Based On Hybrid Text

Posted on:2017-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:H R Lin

Full Text:PDF

GTID:2348330482986925

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Document clustering is the process of grouping similar texts into a set of clusters.K-means is the partitioning clustering method for its efficiency and simplicity in clustering large data set.However,it has a poor clustering effect on dealing with a hybrid text set which contains long text and short text.K-means algorithm requires the same dimension of the extracting feature vector,which leads to the redundancy of short text,and the serious message losses of long text.Meanwhile,the existence of isolated points in text set and the local optimization of initial centroids also affect the clustering result.To solve the above problems,three improved algorithms were proposed:K-means clustering algorithm based on hybrid text,improved algorithm for distance-based outliers detection,and improved algorithm for distance-based optimization of initial centroids.Contrast experiments of document clustering had shown that,the problem of hybrid text set clustering was resolved by improved K-means clustering algorithm based on hybrid text.The algorithm improved the clustering performance and also reflected the the superiority of the speed.The improved algorithm for distance-based outliers detection solved the problem that the number of points must be inputed before clustering.It could output the outliers accurately without the number of isolated points,and also estimated the outlying degree of outliers.The improved algorithm for distance-based optimization of initial centroids solved the problem of the initial central points when the density distribution was uneven.It could output the ideal initial centroids for the uneven density distribution of text set.

Keywords/Search Tags:

hybrid text, K-means, document clustering, text fingerprint

PDF Full Text Request

Related items

1	Based On K-means The Chinese Text Clustering Algorithm
2	High performance text document clustering
3	The implementation of dynamic document organization using the integration of text clustering and text categorization
4	Text Clustering Based On K-means Algorithm And Realization
5	Chinese Text Clustering Based On Text Similarity
6	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
7	Chinese Text Clustering Algorithm Based On Suffix Tree Research
8	Research On Text Clustering Based On Swarm Intelligence Algorithm
9	Design And Implementation Of Distributed Text Clustering System Based On K-means
10	Based On The Text Of The K-means Clustering Analysis