Document Clustering Algorithm Based On Multi-coupled Relation Analysis

Posted on:2019-05-02

Degree:Master

Type:Thesis

Country:China

Candidate:X X Chen

Full Text:PDF

GTID:2428330566996005

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,the data is exploding.A large number of documents are constantly emerging and clustering analysis can get a lot of useful information from these documents.Therefore,how to efficiently cluster documents and apply them to different aspects of text mining and information retrieval becomes an urgent problem to be solved.Document clustering is the representation of the collection of documents in clusters through corresponding document clustering algorithm.The documents with large similarity are in a cluster,and with small similarity are in different clusters.Document clustering is an important topic in data mining and natural language processing.In most document clustering methods,the frequently used technology is the document representation based on the bag of words model.But using this model indicates that the document does not consider the potential relations between words,so these methods on the cluster effect is not satisfactory.At the same time,although considered the coupled relation between words,some document clustering methods are not comprehensive.In this paper,a more comprehensive study is conducted on these important coupling associations.Based on the coupling relation between word items,this paper proposes three effective clustering analysis methods as follows:First of all,we presented an approach based on WordNet and multi-coupled relation analysis.The approach used WordNet dictionary to calculate the semantic similarity between words.At the same time,we calculated display and implicit coupled relations according to the frequency between words.Secondly,according to the original CRM method,which calculated the correlation not directly,this paper proposed a clustering method based on JS divergence.This method used the JS divergence to directly calculate the intra-relation,and the document clustering is carried out with inter-relation.Finally,the first two methods complicated the calculation to improve the clustering effect and calculated the weight inaccurately.A simplified coupled relational document clustering algorithm based on self-information and location word frequency is proposed.This method changed the original weighting method by TF-IDF into by using self-information and position-coupled,simplified the complex computation of implicit coupling and improved the clustering efficiency of documents.In this paper,the three methods proposed are verified experimentally.The three methods of this paper combine k-means and DBSCAN respectively,and used two data sets shows that the three methods are universal.Before clustering,WordNet,JS divergence,self-information and location are used to make the document processing and calculation more accurate.In this paper,all three methods proposed in this paper are compared with the existing coupled methods,and the four clustering evaluation indexes of Purity,RI,F1 and NMI are used.Experimental results show that the proposed method can achieve better clustering results.

Keywords/Search Tags:

document clustering, multi-coupled relation, WordNet, JS divergence, selfinformation, location

PDF Full Text Request

Related items

1	Studies On Semi-supervised Clustering Algorithms Based On Entropy And Divergence
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	Research On Document Clustering Technology Based On Latent Semantic Indexing
4	Research And Application Of Multi-relation Clustering Algorithm Based On Membrane System
5	A Study Of Chinese Multi-document Summarization Based On Adaptive Clustering Algorithm
6	The Research On Multi-document Summarization Generation Method Based On Text Relation Graph
7	Research On Document-level Long Text Relation Extraction Algorithms
8	Document-level Entity Relation Extraction Based On Document Structure And External Knowledge
9	Researches On Diversity Multi-view Clustering
10	Semantic Hierarchical Clustering Based Multi-document Summarization Research