Font Size: a A A

Chinese Text Clustering Algorithm Based On Suffix Tree Research

Posted on:2006-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:L H LuFull Text:PDF
GTID:2208360182456267Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text mining means the implied, useful and interesting patterns and knowledge discovered in substantial conglomeration of text documents or corpus. The availability of text mining technique makes it possible to process the large store of text resources in great batches. The processing upon texts offers much potential for development in such fields as information retrieval.The thesis is will deal with the text clustering. Text clustering which is known as a significant way towards text mining and at the same time an important branch of data mining, lays its emphasis on Chinese text clustering based on suffix tree. As a data structure, the suffix tree was first presented to support the string matching and queries, for instance: searching the maximum repetition substring, matching of the similar strings, stings comparisons etc. STC is a method that regards the text as phrase string not as word corpus. Thus it enables us to use the similar information between the phrases to effect a better clustering. STC has already been successfully utilized in some areas in English text clustering. This paper is devoted to affect the STC in Chinese text clustering.This paper underline it's emphasis on the techniques and theories of data mining, especially focuses on Chinese text clustering. The paper includes the following main aspects:(1) Research on text clustering algorithm, especially on k-means algorithm and its application to the Chinese texts.(2) Study on Chinese text clustering models in compliance with the characteristics of Chinese texts.(3) The feasibility of applying the suffix tree technique to Chinese text clustering has been studied deeply and tested.(4) Design and implement a Chinese text clustering system which has the clustering function in the k-means and STC algorithms.(5) Some valuable results on several groups of the Chinese text data sets are obtained and theoretically explained and demonstrated after some experiments are carried out and comparisons are made between thek-means and STC algorithms. The problems occurred in the experiments are discussed and a future research direction is presented.Lu Lihua (Computer Application Technology) Directed by Prof. Gao Maoting...
Keywords/Search Tags:Text Mining, Text Clustering, K-means, STC
PDF Full Text Request
Related items