Font Size: a A A

Study On Text Category Oriented Chinese Text Mining And Its Implementation

Posted on:2005-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:A H XuFull Text:PDF
GTID:2168360122990520Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet and Information Technology, more and more information has been expressed as text. How to obtain the useful information quickly and efficiently from large text is getting more and more important. Text information mining is a new technology that adopts data mining method to retrieve imformation from text. It is a new issue that draws great interest. Many people do a great of job on it, but most of them focus on the English text mining and few of them pay attention to Chinese text mining. In the thesis, we investigate Chinese text mining and on the base of them a Chinese text categorization system has been implemented.Chinese phrase segmention is the premiss and difficulty that we analyze the Chinese text. We design a new algorithm for Chinese phrase segmention by tagging the lexicon with useful words and useless words and building two levels index for Chinese thesaurus on the base of doctor Chen Guilin's method, whose time complexity is superior to that of the current algorithms. Using this method, we can extract several synthetic features to stand for the entire former information well, and can reduce the dimension greatly.In this paper, a text classification system is designed and implemented. It discusses some key techniques in the implementation of this model. We adopt Vector Space Model (VSM) to represent documents and evaluate the classification algorithm through two norms, which are recall and precision. And a mutural information method is adopted on feature extraction. Especially, the support vector machine (SVM) text classification algorithm is discussed. We introduce the linear SVM and the nonlinear SVM and analyze the reason that SVM is superior to other methods in theoretical. The series-parallel method is used to tune the parameter and the working set and buffer technology is adopted to boost the arithmetic efficiency. The results of the experiment show that the two norms used for evaluating the classification algorithm: precision and recallare satisfying.
Keywords/Search Tags:Text Mining, Text categorization, Word Segmentation, Vector Space Model (VSM), Support Vector Machine (SVM)
PDF Full Text Request
Related items