Font Size: a A A

Research And Implementation On The Related Algorithms Of Chinese Text Classification Based On SVM

Posted on:2009-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:P ChenFull Text:PDF
GTID:2178360242488695Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text Classification organizes information according to structures, content of text and so on to help the people pick out the information they need. Support Vector Machine (SVM) is the hot spot in Machine Learning and Pattern Recognition fields. It is used widely in Text Classification recently.This paper takes SVM theory as the foundation, researches the related algorithms of The Chinese Text Classification, designs and implements a Chinese Text Classification System using these algorithms. It includes:(1) Preprocessing. FMM, MM and an improved word segmentation algorithm are implemented. This algorithm improved the traditional plain text vocabulary, uses a dictionary structure of first character index and second-level Hash. Meanwhile the matching rule of this improved Segmentation algorithm can solve the problems of ambiguities and unknown word effectively; then add code strategy to the Stop Word matching process to eliminate the Stop Word.(2) Feature Selection processing. Four algorithms including Mutual Information (MI), Document Frequency (DF), Information Gain (IG) and x~2 (CHI) are implemented. Three influencing factors about Feature to the precision of Classification are expressed with formula, then they are unified with MI, an improved Feature Selection algorithm based MI is proposed. This algorithm retained the original MI's merit of calculating simply, also is advantageous in choosing the strong associated word.(3) Classification module constructing. Multi-class SVM(M-SVMs) is extended from the standard SVM to fit the situation of multi-classification; a Incremental Learning method based SVM is proposed for classifying the dynamic samples; a constructing method of a improved Combined Leaning algorithm about AdaBoost based SVM is proposed, this method using rule sampling can be advantageous in the sample whose distribution is not balance.In addition, all algorithms in each module of this system are contrasted and evaluated through experiments.
Keywords/Search Tags:Text Classification, SVM, M-SVMs, Incremental Learning, Combined Leaning
PDF Full Text Request
Related items