Research On Text Clustering Algorithm Based On 2 Degree Frequent Word Sequence

Posted on:2010-02-19

Degree:Master

Type:Thesis

Country:China

Candidate:W C Ma

Full Text:PDF

GTID:2178360275956566

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

With the development of information technology,The electronic form existence's text information already became the main sources where people find knowledge.Text mining became the tool for people to obtain the resource and knowledge.because of no need of class lable,text clustering has already become a new research focus.The traditional cluster algorithm have not slove the problem:the high data dimensions, low clustering accuracy,no understandable description of the clusters.Frequent term-based text clustering sloves the above problems.Frequent Itemset-based Hierarchical Clustering provides a hierarchical topic structure which is easy to browse,but the frequent word sets sometimes can not represent text documents very well,text clustering based on frequent word sequence can express the document better than using the frequent word sets while it can improve the clusterstering accuracy,but there are so many overlap between clusters.This article's main works are as follows:1.Because of the CFWS represent text by using the frequent word sequences,have not try to put the texts into the appropriate cluster.there may be many overlap between clusters.This paper assign the document to a appropriate cluster by using the verification of 2 degree frequent word sequence of the cluster label in the text,meet the disadvantage of the CFWS,improve the clustering precision.2.Because of the lose of the semantic information between features when represent a text by the Vector Space Model.I proposed 2 degree frequent word sequence,and constructed a new text expression model with 2 degree frequent word sequences,this model preserve the sequence between words,and it can express the semantic information well.3.Base on the improvement of the oretical of the algorithm,this paper comparied the experiment result between CFWS and my method,and the result shows that:the improved algorithm slove the too many overlap between clusters produced by CFWS, regardless of traditional cluster appraisal standard and the cluster precision,the algorithm improved the clustering result to a certain extent.

Keywords/Search Tags:

text clustering, 2 degree frequent word sequence, cluster description

PDF Full Text Request

Related items

1	Study And Implementation Of Frequent Closed Word Sequence Set Based Hierarchical Clustering Algorithm
2	The Research Of Text Clustering Based On Frequent Selected Word Set
3	Research On Short Text Clustering And Cluster Description Method
4	Research On Text Clustering Of Micro-blog Public Opinion: Word Sense Cluster And Collocation-Based Method
5	Research On Web Document Clustering Based On Sentential Maximum Frequent Word Sets
6	Research On Microblog Hot Topic Discovery Technology Based On Frequent Word Sets
7	Short Text Clustering Method Based On BTM
8	Message Text Clustering Based On Frequent Patterns
9	Text Clustering Method Based On Frequent Itemsets
10	Clustering Algorithm Research Of Short Text Based On Semantic Similarity