| With the development of information technology,The electronic form existence's text information already became the main sources where people find knowledge.Text mining became the tool for people to obtain the resource and knowledge.because of no need of class lable,text clustering has already become a new research focus.The traditional cluster algorithm have not slove the problem:the high data dimensions, low clustering accuracy,no understandable description of the clusters.Frequent term-based text clustering sloves the above problems.Frequent Itemset-based Hierarchical Clustering provides a hierarchical topic structure which is easy to browse,but the frequent word sets sometimes can not represent text documents very well,text clustering based on frequent word sequence can express the document better than using the frequent word sets while it can improve the clusterstering accuracy,but there are so many overlap between clusters.This article's main works are as follows:1.Because of the CFWS represent text by using the frequent word sequences,have not try to put the texts into the appropriate cluster.there may be many overlap between clusters.This paper assign the document to a appropriate cluster by using the verification of 2 degree frequent word sequence of the cluster label in the text,meet the disadvantage of the CFWS,improve the clustering precision.2.Because of the lose of the semantic information between features when represent a text by the Vector Space Model.I proposed 2 degree frequent word sequence,and constructed a new text expression model with 2 degree frequent word sequences,this model preserve the sequence between words,and it can express the semantic information well.3.Base on the improvement of the oretical of the algorithm,this paper comparied the experiment result between CFWS and my method,and the result shows that:the improved algorithm slove the too many overlap between clusters produced by CFWS, regardless of traditional cluster appraisal standard and the cluster precision,the algorithm improved the clustering result to a certain extent. |