Font Size: a A A

Document Clustering Based On Hybrid Text

Posted on:2017-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:H R LinFull Text:PDF
GTID:2348330482986925Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Document clustering is the process of grouping similar texts into a set of clusters.K-means is the partitioning clustering method for its efficiency and simplicity in clustering large data set.However,it has a poor clustering effect on dealing with a hybrid text set which contains long text and short text.K-means algorithm requires the same dimension of the extracting feature vector,which leads to the redundancy of short text,and the serious message losses of long text.Meanwhile,the existence of isolated points in text set and the local optimization of initial centroids also affect the clustering result.To solve the above problems,three improved algorithms were proposed:K-means clustering algorithm based on hybrid text,improved algorithm for distance-based outliers detection,and improved algorithm for distance-based optimization of initial centroids.Contrast experiments of document clustering had shown that,the problem of hybrid text set clustering was resolved by improved K-means clustering algorithm based on hybrid text.The algorithm improved the clustering performance and also reflected the the superiority of the speed.The improved algorithm for distance-based outliers detection solved the problem that the number of points must be inputed before clustering.It could output the outliers accurately without the number of isolated points,and also estimated the outlying degree of outliers.The improved algorithm for distance-based optimization of initial centroids solved the problem of the initial central points when the density distribution was uneven.It could output the ideal initial centroids for the uneven density distribution of text set.
Keywords/Search Tags:hybrid text, K-means, document clustering, text fingerprint
PDF Full Text Request
Related items