Font Size: a A A

Document Clustering Algorithm Based On Multi-coupled Relation Analysis

Posted on:2019-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:X X ChenFull Text:PDF
GTID:2428330566996005Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the data is exploding.A large number of documents are constantly emerging and clustering analysis can get a lot of useful information from these documents.Therefore,how to efficiently cluster documents and apply them to different aspects of text mining and information retrieval becomes an urgent problem to be solved.Document clustering is the representation of the collection of documents in clusters through corresponding document clustering algorithm.The documents with large similarity are in a cluster,and with small similarity are in different clusters.Document clustering is an important topic in data mining and natural language processing.In most document clustering methods,the frequently used technology is the document representation based on the bag of words model.But using this model indicates that the document does not consider the potential relations between words,so these methods on the cluster effect is not satisfactory.At the same time,although considered the coupled relation between words,some document clustering methods are not comprehensive.In this paper,a more comprehensive study is conducted on these important coupling associations.Based on the coupling relation between word items,this paper proposes three effective clustering analysis methods as follows:First of all,we presented an approach based on WordNet and multi-coupled relation analysis.The approach used WordNet dictionary to calculate the semantic similarity between words.At the same time,we calculated display and implicit coupled relations according to the frequency between words.Secondly,according to the original CRM method,which calculated the correlation not directly,this paper proposed a clustering method based on JS divergence.This method used the JS divergence to directly calculate the intra-relation,and the document clustering is carried out with inter-relation.Finally,the first two methods complicated the calculation to improve the clustering effect and calculated the weight inaccurately.A simplified coupled relational document clustering algorithm based on self-information and location word frequency is proposed.This method changed the original weighting method by TF-IDF into by using self-information and position-coupled,simplified the complex computation of implicit coupling and improved the clustering efficiency of documents.In this paper,the three methods proposed are verified experimentally.The three methods of this paper combine k-means and DBSCAN respectively,and used two data sets shows that the three methods are universal.Before clustering,WordNet,JS divergence,self-information and location are used to make the document processing and calculation more accurate.In this paper,all three methods proposed in this paper are compared with the existing coupled methods,and the four clustering evaluation indexes of Purity,RI,F1 and NMI are used.Experimental results show that the proposed method can achieve better clustering results.
Keywords/Search Tags:document clustering, multi-coupled relation, WordNet, JS divergence, selfinformation, location
PDF Full Text Request
Related items