Font Size: a A A

Designed And Implementation Of Chinese Text Categorization System Based On Support Vector Machine

Posted on:2013-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:L XueFull Text:PDF
GTID:2248330362474780Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Information on the Internet is growing in an explosive way, and large amounts ofthis information are in the form of text. Besides, there are lots of text documentproduced and retained from information system of organizations like enterprise,research institution, universities and so on. Therefore, the management and utilize ofthese text information becomes a great challenge for users. Fortunately, the technologiesfor text processing are proposed and developed, and among which, text classificationcan extracts the classical information of text, which serves as a foundation of textprocessing, and meanwhile, research on text classification tends to be much moresignificant in text processing area.This paper proposes and develops a text classification system based on SVM(Support Vector Machine), which includes three parts by process: training, classificationand results. The training part constructs the classifier from tanning documents, and thenthe key work of classification is done in classification part, finally part is the show andevaluation for the classification results.The modules of this text classification system are as follows:1. Text Pre-processing Module: it includes the processing for the Chinese wordsegmentation and stop word. In this paper, the tool of ICTCLAS by ChineseAcademy of Science is used to pre-process the text.2. Feature Selection Module: it implements five methods of Feature Selection,which are respectively, the Information Gain, Mutual Information, CrossEntropy, Chi-square, and the Weight of Evidence for Text.3. Weight Computation Module: it computes the weight by TF*IDF andTF*IDF*IG methods.4. Text Presentation Module: it represents the text in vector space model.5. Classifier Construction Module: it adopts kernel functions realized by linear,polynomial, RBF and sigmoid methods, and trains the classifier in aone-to-many way.6. Classification Module: it classifies the text by using the trained classifier.7. Results Module: it gives the classification results and also with someperformance evaluation.The experiments are carried on Sogou corpus, and they make some comparisons when taking different probability estimate, kernel functions, different feature selection,and weight computing methods. Then some analysis is given based on the experimentalresults.
Keywords/Search Tags:Text classification, Support vector machine, Text presentation, Featureselection, Weight computing
PDF Full Text Request
Related items