Font Size: a A A

Research On Uyghur Text Classification And System Development

Posted on:2013-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:H M T J A B L T AiFull Text:PDF
GTID:2218330374966405Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and network technology, theInternet has been widely applied. The rapid growth of Web information bring achallenge to information retrieval, and make our work more difficult to findneedful information from it. Text classification plays a key and effective role fordealing with clutter information, have important applications in areas such asinformation retrieval, search engines, digital library management.This paper starts from the characteristics and writing rules of Uyghur,finished the establishment of (including20categories,300of each type of text)larger amount text corpus. In-depth study and careful consideration of thecharacteristics of the Uyghur and the rules of grammar, through a largenumber of experimental and human review established a relatively completestop words list. Analyzed the affluence of stem extraction to the precision andthe speed of text classification. Reduce the vector space dimension is one ofthe very important problems in text classification, as for this, in this paper, byusing the Uyghur lexical rules and stem extraction method, realized not toaffect the classification accuracy of the Uyghur text, at the same time reachedvery good dimensionality reduction purposes. The dimension reduction ratereached25%after accessing the stem extraction method.CHI statistical feature selection methods used in the feature extractionmethod, experimental results show that, select the3%-5%of the originalfeatures, relatively speaking, is the best characteristics. Analyzed the effects ofUyghur word spelling errors on Uyghur text classification by a large number ofexperiments. The experimental results show that the spelling errors have littleeffect on the classification of the Uyghur text, but reduce the dimension of thevector space have a certain impact.Studied KNN, Naive Bayes (NB), the SVM classification algorithm morethoroughly, which are widely have been used in domestic and foreign country,classified Uyghur text by using these algorithms, and analyzed theperformance of each algorithm on the Uyghur text. Finally, by combining theUyghur characteristics and text classification techniques, built the Uyghur textclassification experiment platform (the Uyghur text classification system).
Keywords/Search Tags:Uyghur, text classification, stem extraction, feature dimensionreduction, classifier
PDF Full Text Request
Related items