Research On Uyghur Text Classification And System Development

Posted on:2013-01-12

Degree:Master

Type:Thesis

Country:China

Candidate:H M T J A B L T Ai

Full Text:PDF

GTID:2218330374966405

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer and network technology, theInternet has been widely applied. The rapid growth of Web information bring achallenge to information retrieval, and make our work more difficult to findneedful information from it. Text classification plays a key and effective role fordealing with clutter information, have important applications in areas such asinformation retrieval, search engines, digital library management.This paper starts from the characteristics and writing rules of Uyghur,finished the establishment of (including20categories,300of each type of text)larger amount text corpus. In-depth study and careful consideration of thecharacteristics of the Uyghur and the rules of grammar, through a largenumber of experimental and human review established a relatively completestop words list. Analyzed the affluence of stem extraction to the precision andthe speed of text classification. Reduce the vector space dimension is one ofthe very important problems in text classification, as for this, in this paper, byusing the Uyghur lexical rules and stem extraction method, realized not toaffect the classification accuracy of the Uyghur text, at the same time reachedvery good dimensionality reduction purposes. The dimension reduction ratereached25%after accessing the stem extraction method.CHI statistical feature selection methods used in the feature extractionmethod, experimental results show that, select the3%-5%of the originalfeatures, relatively speaking, is the best characteristics. Analyzed the effects ofUyghur word spelling errors on Uyghur text classification by a large number ofexperiments. The experimental results show that the spelling errors have littleeffect on the classification of the Uyghur text, but reduce the dimension of thevector space have a certain impact.Studied KNN, Naive Bayes (NB), the SVM classification algorithm morethoroughly, which are widely have been used in domestic and foreign country,classified Uyghur text by using these algorithms, and analyzed theperformance of each algorithm on the Uyghur text. Finally, by combining theUyghur characteristics and text classification techniques, built the Uyghur textclassification experiment platform (the Uyghur text classification system).

Keywords/Search Tags:

Uyghur, text classification, stem extraction, feature dimensionreduction, classifier

PDF Full Text Request

Related items

1	Based On The Stem Of The Uyghur Language Text Cluster Research And Implementation
2	Uyghur Text Clustering System Design And Implementation Based On Python
3	Study On Key Techniques Of Uyghur Character Recognition
4	Research On On-Line Uyghur Character Recognition Technology Based On Features Combination
5	Automatic Extraction Of Uyghur Ontology Concept Classification Relationship Based On Seed Bootstrap
6	Design And Implementation Of Uyghur Text Classifier Based On Generalized Information Entropy
7	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
8	Study On The Text Classification Feature Selection Method-the Uyghur Language
9	Learning-Based Text Extraction In Natural Background
10	Research On Feature Description And Classifier Construction Algorithm In Chinese Text Classification