Research And System Implementation Of Uyghur Text Classification Based On N-gram

Posted on:2015-05-06

Degree:Master

Type:Thesis

Country:China

Candidate:E N S G L S M T Tu

Full Text:PDF

GTID:2298330431491892

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Feature extraction is an important link in text classification,feature extraction canchoose characters,words and phrases as characteristics.Take words as features,thefeature extraction process requires segmentation tool, stemming tool, part-of-speechtagging, semantic analyzer, electronic dictionaries, spell check tool, full stop list andstandard text corpus and other related tools and resources, but the Uyghur informationprocessing technology is further improved and the consolidation stage, published onthe Internet about the tools and resources are rare. Because of the Uyghur language isan agglutinative language,connect the additional component of many words,morphological changes of words is very rich, so itâ€™s hard to avoid spelling mistakesand grammatical errors. Considering the above situation, this paper designed theUyghur text classification system based on N-gram, the characteristics of this systemis that it does not need stemming, part-of-speech tagging and other natural languagetools. Spelling errors on text classification is reduced to a minimum.In the process of feature extraction, the paper discusses the character level ofN-gram model. Second parameter selection problem of N in the Uyghur N-grammodel were studied in-depthly. In the feature selection method is adopted with thecontext information related to N-gram frequency statistics method, built on thecollected training text set N-gram feature library in each type of text. Classificationexperiments with the methods of Manhattan and Dice distance similarity in the testcorpus. When the N-gram model parameter n are the same, increases the number offeatures, the system of classification performance improved, but the number offeatures to up to400classification performance is declined.. The experimental resultsshow that the text with5-gram, when the number of features is400, Manhattansimilarity method to get the best classification performance, the2-gram classificationperformance is the worst.Finally,combining the characteristics of the uyghur languageand text categorization method based on N-gram frequency statistics,,design and implementation of a Uyghur text classification experiment platform (the Uygur textclassification system based on N-gram).

Keywords/Search Tags:

Uyghur, Text classification, N-gram language model, N-gram profile, Similarity distance

PDF Full Text Request

Related items

1	Research On Uyghur Full-text Information Retrieval Based On N-gram Characters Model
2	Research On Short Text Emotion Classification Method Based On Word2Vec And N-Gram
3	Language Independent Text Categorization
4	Research On Language Independent Text Categorization
5	Language-independent text learning with statistical n-gram language models
6	Research Of Uyghur N-gram Model And Smoothing Algorithm
7	Research On N-gram Based Hierarchical Text Language Identification
8	Research On Language Identification Of Social Media Short Text Based On N-Gram Vector Feature
9	Research On The Contextual Cohesion Of Social Media Texts For News
10	Researching And Building Of The Mongolian Large Vocabulary Independent Continuous Speech Recognition System