Text Clustering Based On Frequent Word Sets And Complex Networks

Posted on:2020-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:M Chen

Full Text:PDF

GTID:2370330599452930

Subject:engineering

Abstract/Summary:

PDF Full Text Request

Nowadays,social networking has gone mainstream.There are various channels including microblog,wechat and headline news,to acquire abundant text resources.Web text mining gradually becomes needed and valued,as well as text clustering.This paper focuses on text clustering study.Traditional clustering methods based on the vector space model.Since the text quantity is counted by millions,which is countless,traditional VSM in most cases leaves us unproperly high text dimensionality and sparsity.This paper introduces the concept of ―Frequent Term Set‖ from data mining space to solve the problems.Based on the FTS,text representation can achieve dimensionality reduction.Then this paper introduces the second concept of ―complex internet‖.Under this concept,the primitive text sets are shaped by the text internet,thus the relations between each text is no longer one to one,but many to many.Therefore,text clustering based on complex network model can better reflect the relations between texts than traditional clustering.And it shows more the similarities among texts.Using community discovery algorithm to divide text networks into communities,complex text networks can be divided into communities.Each community represents a cluster in the clustering process.Traditional community discovery algorithms are generally based on graph partitioning theory and modularity optimization algorithm model.These methods have many disadvantages,such as high complexity,double counting.Therefore,this paper introduces the probabilistic algorithm model in machine learning,and proposes an improved k-means algorithm based on DPCA to conduct community discovery.Combining with DPCA algorithm,the initial central node and the selection of K value in k-means algorithm are determined.And the density of data points in DPCA algorithm is subtly replaced by the degree of nodes.Finally,through the experimental comparison,the algorithm used in this paper has better effect on text clustering than the traditional text clustering method.

Keywords/Search Tags:

Frequent Term Set, dimensionality reduction, complex internet, community discovery algorithm, text clustering

PDF Full Text Request

Related items

1	Short Text Clustering Based On Frequent Word Co-occurrence Network
2	Research And Application Of Community Discovery Based On Markov Clustering Algorithm
3	Research On Community Discovery Based On Text Attribute Information
4	Microblog Community Discovery Algorithm Based On Spark
5	Research On Community Discovery Algorithm Based On Spectral Clustering
6	Research On The Community Discovery Algorithm Based On Tightness
7	Research On Community Structure Detection Algorithms In Complex Networks
8	Research On Community Discovery Algorithm For Large-scale Complex Networks
9	Research On Community Discovery Method Based On Autoencoder
10	Research And Implementation Of Dynamic Overlapping Community Discovery Algorithm