Font Size: a A A

Research On Text Clustering Algorithm Based On DBSCAN

Posted on:2017-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:H C LiuFull Text:PDF
GTID:2348330488451592Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and information technology,on the Internet every day,a huge number of information production and dissemination,which contains a large number of Web text information,such as in the form of news and blog content of text data.Data mining technology is an important part of text mining,to through(pre text processing technology,feature extraction,text clustering method,etc.)of massive text data to extract the effective,and to explore the useful information,and text clustering method is an important part of the mining technology in,the quality of the clustering algorithm directly affects the quality and effectiveness of information extraction.In recent years,more and more research on text clustering technology is used to improve the classical clustering algorithms,and it is a hot topic in the present research.However,at present,most of the researches focus on the transplantation and transformation of classical clustering algorithms based on partition and hierarchy,such as text clustering algorithm based on k-mean and Single-Pass algorithm.As is known to all,the density based clustering algorithm DBSCAN(density based spatial clustering of applications with noise)method is not very sensitive to noise data,and has ability to detect arbitrary shape clustering,and this characteristic can be better so that quality of the text clustering results compared with other approaches.In this paper,the characteristics of the DBSCAN method to study,so that it can effectively implement the clustering of text data.First of all.this article through combining the K-nearest neighbor method presents a D-DBSCAN text clustering algorithm to determine the input parameters based on DBSCAN,the algorithm by using K-nearest neighbor method to get the optimal value of the K data set,the optimal characteristics of the spatial data set by the K value,the object and the DBSCAN algorithm in the feature space by computing data set the required scanning radius and the cluster contains at least number of objects,using the above parameters according to the data set generated by the data sets for running DBSCAN,thus avoiding the DBSCAN method provides empirical initial parameters of trouble;secondly,this paper also proposes a K-means strategy optimization KS-DBSCAN text clustering algorithm based on the method by using the cluster merge strategy,effectively improve the clustering speed of text clustering algorithm.Finally,in order to verify the proposed algorithm is effective.In this paper,we use Sogou corpus of news text sets as the experimental data.using text preprocessing of the text assembly,the formalization of text set respectively,using the two methods proposed in this paper,the k-means method.Gaussian mean and DBSCAN algorithm for clustering.Experimental results show that the proposed two kinds of text clustering algorithm in accuracy,recall and F value of performance is better with satisfactory results,and the second method KS-DBSCAN in clustering time consumption than DBSCAN algorithm is improved.Through the research of this paper,the text clustering method based on density clustering can provide a more efficient clustering method in the text mining process.
Keywords/Search Tags:Text mining, text clustering, k-means, DBSCAN
PDF Full Text Request
Related items