| With the rapid development of internet and intranet, a sharp increase in the number of a variety of electronic text data. How to quickly and efficiently access, manage and use these texts, has become an urgent and important issues in the areas of information systems. In recent years, as one of the basic tools to solve these problems, automatic text clustering technology based on the content of the text has undergone an unprecedented development, which has aroused widespread concern.The goal of text clustering is to dividing the text of the document collection into several clusters, which requires the similarity of the same clusters within the content of the document as big as possible while the similarity between the different clusters as small as possible. As an important application in text mining, text clustering has become a hot research.This paper first introduced the background of the text mining research, research significance, and research related to the basic theory of knowledge.Second, it analyzed and studied the text of the pretreatment process, focused on word segmentation problems for Chinese text. It adopted the maximum match algorithm in the word segmentation, with back to a word and the method based on word frequency to find and dispel word ambiguity .It discussed the characteristics of expression and choice of features for pre-text, used Vector Space Model (VSM) presenting the text and used the evaluated function tfidf to choose the text features.Then, For the Chinese text clustering, it used twice text clustering method based on k-means algorithm.First, it applyed k-means in texts clustering while choose the value of k from a certain range that maximum the average silhoustte coefficient and the selection of initial center is by a method based on Sample density.At the same time,experiment showed that the feasibility of the two methods used to determine the initial parameters.For the result of first clustering,if a cluster contained the number of samples much higher than the number of samples that the other clusters contained,then re-cluster the cluster.Finally, this paper designed a text clustering system, and tested the twice clustering effect for Chinese text in this paper.Test results show that as an experimental system, the main indicator of the performance of the basic satisfactory. |