Font Size: a A A

Research And System Implementation Of Uyghur Text Classification Based On N-gram

Posted on:2015-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:E N S G L S M T TuFull Text:PDF
GTID:2298330431491892Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Feature extraction is an important link in text classification,feature extraction canchoose characters,words and phrases as characteristics.Take words as features,thefeature extraction process requires segmentation tool, stemming tool, part-of-speechtagging, semantic analyzer, electronic dictionaries, spell check tool, full stop list andstandard text corpus and other related tools and resources, but the Uyghur informationprocessing technology is further improved and the consolidation stage, published onthe Internet about the tools and resources are rare. Because of the Uyghur language isan agglutinative language,connect the additional component of many words,morphological changes of words is very rich, so it’s hard to avoid spelling mistakesand grammatical errors. Considering the above situation, this paper designed theUyghur text classification system based on N-gram, the characteristics of this systemis that it does not need stemming, part-of-speech tagging and other natural languagetools. Spelling errors on text classification is reduced to a minimum.In the process of feature extraction, the paper discusses the character level ofN-gram model. Second parameter selection problem of N in the Uyghur N-grammodel were studied in-depthly. In the feature selection method is adopted with thecontext information related to N-gram frequency statistics method, built on thecollected training text set N-gram feature library in each type of text. Classificationexperiments with the methods of Manhattan and Dice distance similarity in the testcorpus. When the N-gram model parameter n are the same, increases the number offeatures, the system of classification performance improved, but the number offeatures to up to400classification performance is declined.. The experimental resultsshow that the text with5-gram, when the number of features is400, Manhattansimilarity method to get the best classification performance, the2-gram classificationperformance is the worst.Finally,combining the characteristics of the uyghur languageand text categorization method based on N-gram frequency statistics,,design and implementation of a Uyghur text classification experiment platform (the Uygur textclassification system based on N-gram).
Keywords/Search Tags:Uyghur, Text classification, N-gram language model, N-gram profile, Similarity distance
PDF Full Text Request
Related items