Font Size: a A A

Text Clustering Based On Frequent Word Sets And Complex Networks

Posted on:2020-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:2370330599452930Subject:engineering
Abstract/Summary:PDF Full Text Request
Nowadays,social networking has gone mainstream.There are various channels including microblog,wechat and headline news,to acquire abundant text resources.Web text mining gradually becomes needed and valued,as well as text clustering.This paper focuses on text clustering study.Traditional clustering methods based on the vector space model.Since the text quantity is counted by millions,which is countless,traditional VSM in most cases leaves us unproperly high text dimensionality and sparsity.This paper introduces the concept of ?Frequent Term Set? from data mining space to solve the problems.Based on the FTS,text representation can achieve dimensionality reduction.Then this paper introduces the second concept of ?complex internet?.Under this concept,the primitive text sets are shaped by the text internet,thus the relations between each text is no longer one to one,but many to many.Therefore,text clustering based on complex network model can better reflect the relations between texts than traditional clustering.And it shows more the similarities among texts.Using community discovery algorithm to divide text networks into communities,complex text networks can be divided into communities.Each community represents a cluster in the clustering process.Traditional community discovery algorithms are generally based on graph partitioning theory and modularity optimization algorithm model.These methods have many disadvantages,such as high complexity,double counting.Therefore,this paper introduces the probabilistic algorithm model in machine learning,and proposes an improved k-means algorithm based on DPCA to conduct community discovery.Combining with DPCA algorithm,the initial central node and the selection of K value in k-means algorithm are determined.And the density of data points in DPCA algorithm is subtly replaced by the degree of nodes.Finally,through the experimental comparison,the algorithm used in this paper has better effect on text clustering than the traditional text clustering method.
Keywords/Search Tags:Frequent Term Set, dimensionality reduction, complex internet, community discovery algorithm, text clustering
PDF Full Text Request
Related items