Font Size: a A A

A Study On Chinese Text Automatic Categorization

Posted on:2003-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:L H SunFull Text:PDF
GTID:2168360092966493Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The thesis sununarizes systematically some techniques about wordsegInentation, featUre selection, categOrizing algoritInn and PerformanceestiInating in Chinese text categorization. It discusses three kinds of Chinesetext caegorizahon methods like Bayes method, k Nearest Neighbor (kNN) andSupport Vector Machines (SVM). Author develop three text dassiliers lilienaive-Bayes classifier, k Nearest Neighbor classifier and SVM classifier.Furthermore, including the three classifiers, one text categorisation system isbuilt up, and it has high prachcability.Because of the weakness aPPeared in kN'N prediction an hoproved kNN withmore satisfactory performance is put forword here.It gives a stress on SVM. Basing on the Statistical Inaming T'heory (SLT), thethesis discusses the SVM problems in linearly separable case, lineariynon-separable case and non-linear separable case, and induces a convex quadraticprogramming (QP) problem with an equation constrain and non-equationconstrains. Then one program on solving the OP problem is proposed. Forleaming largerscales texts corpus by SVM, it is bohant that decomposihonmethod optAnises the SVM with respect to subsets and recursively solves thewhole SVM. The C a estiInator based on Leave-One-Out test can performemciently and effectively estiInating in term of error rate, precision, recall and F1.The thesis proves the effectiveness of the decomPOsihon method. Five measuresfOr reducing leaming time are adopted. Thirdly, the multiclass SVM classifierbased on one-against-rest mode is developed. Because there are some text that arenot distingUished, kNN method and featUre matching algoritllln can post-classifythe non-distingUished text. By this processing the classifier perfOrmance isAnproved much.It can be much dear from the experirnent result that the three classifiers aresuitable to Chinese text caegOrization. And SVM classifier is more satisfaCtorythan others. Wth post-classnying by means of kNN and featUre matching, themulti-dass SVM has top sahsfaCtory performance.
Keywords/Search Tags:Chinese Information Processing, Chinese Text Antomatic Categorization, Bayes Text Classifier, k Nearest Neighbor Classifier, Support Vector Machines
PDF Full Text Request
Related items