Font Size: a A A

Research And Implement On The Related Algorithms Of Chinese Text Classification

Posted on:2008-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:R P YuFull Text:PDF
GTID:2178360215964588Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, especially the popularization of Internet Application, the electronic text information greatly increases. It is a great challenge for information science and technology to organize and process so large amount of data, and find out the interesting information for the users quickly and exactly. One way of managing the texts efficiently is Text Automatic Classification. Text Automatic Classification is an important intelligent information processing method, which is of great application value in such fields as information filtering, information retrieval, text database, digital library and so on.This paper discusses the applications of Text Classification in the domains of nature language, text mining, machine learning and pattern discrimination. The Text Classification technology and related algorithms are introduced. A Chinese Text Automatic Classification System is designed and implemented for finding out the problems and rules in all algorithms of Text Classification. The system has training module and classifying module. Training module includes: (1) Chinese text preprocessing. Chinese word segmentation based on FMM algorithm is implemented. And a useful stop-word dictionary is made by experiment. (2) Terms selection. Five algorithms including Information Gain, Mutual Information (MI), x~2 Statistic, Cross Entropy (CE), Document Frequency (DF) are implemented. (3) Weight computing. Various weight algorithms including Term Frequency (TF), TF*IDF, TF*term's evaluating value, TF*IDF*term's evaluating value etc are implemented. (4) Classification model constructing. Three classification algorithms based on statistic method including Class-center Classification, Bayes, KNN are implemented. Unlabeled text is classified by classification model in classifying module. The result is evaluated and fed back to the training module for improving the process of training by the part of evaluation.The experimental value of parameters related and the better combination of these algorithms etc are obtained by experiments. The experiment data can be used in information retrieval, information filtering, digital library, Web-page classification and so on.
Keywords/Search Tags:Chinese Text Classification, Word Segmentation, Terms Selection, Weight Computation
PDF Full Text Request
Related items