The Key Issues In Uyghur Natural Language Processing

Posted on:2016-03-14

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L D T E X Pa

Full Text:PDF

GTID:1318330482977458

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the growing of computer technologies, especially the getting popularity and fast development of Internet, the Natural Language Processing is receiving importance from the researchers on computer science field. Meanwhile, the information processing studies for minority ethnic groups are also being emerged and going ahead day by day. The "One Belt and One Road" strategy of central government gives the development of information processing technology for minority groups more significance and provides the precious opportunity on one hand, also, on the other hand, puts great challenge for the studies on it.taking the multi-model, multi-coded and adhesive properties of Uyghur characters into consideration, this paper proposes a code transformation technique between the Arabic based and Slavic based scripts for Uyghur language. Then, Some more specific studies on NLP techniques including part-of-speech tagging (POS), word-stem extraction are elaborated. Finally, on this script code transformation platform, the results of comparing experiments using three classification algorithms are analyzed and some conclusions are given. Main research contents are as follows:1. This paper presented a rule and dictionary combination based code transformation method. Binary search method is used due to the binary structure of documents in corpus. The code transformation system has been developed using Microsoft middle-ware platform and an automatic code transformation system has got realized.2. This work proposed a maximum entropy based POS tagging model for Uyghur harmonious stems and affixes, established a POS tagging feature template and designed its corresponding feature function. The experimental analysis demonstrated that the maximum entropy based model has acceptable capability to handle Uyghur POS tagging of ambiguous and unlisted words, and obviously outperforms than other POS taggers for Uyghur words.3. A multi-strategy integrated Uyghur word-stem extraction method is proposed. This work designed a word-stem segmentation approach based on rule and dictionary combination, maximum entropy and limited status automation. The results from the experiments on authoritative data corpus put the fact that the proposed approach has improved the word-stem extraction accuracy for nouns.4. This paper introduces the Uyghur text classification technology, and a text corpus for classification performance test is established. The implementation of stem extraction and CHI statistical feature selection methods showed great contribution in dimension reduction. On a substantial volume of text corpus, this paper analyzed the performance of KNN, NB and SVM classification algorithms for the Uyghur texts. The comparative experiments showed that SVM can provide the best performance among the three algorithms.To sum up, this paper conducted analysis and research on Uyghur natural language processing techniques including character code transformation, POS tagging, stem extraction and their effects on text classification. The comparative studies on experiments acquired some valuable results and provided reference for later research.

Keywords/Search Tags:

Uyghur Information processing, Part-of-speech tagging, codding convert, stemming, text classification

PDF Full Text Request

Related items

1	The Research Of Uyghur Stemming Based On Morfessor And POS Tagging
2	Research On Text Classification Method Based On Part Of Speech Tagging LDA Model
3	Research On Text Document Information Hiding
4	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
5	An Analysis Of Kazak 's Lexical Method Based On Web Corpus
6	Chinese Word Found Its Part Of Speech Tagging
7	Fast KNN Text Categorization Method Based On Improved Hash Algorithm
8	Causal Relation Extraction Of Uyghur Events
9	Research On Part Of Speech Tagging System Of Pre-Qin Classics Oriented To Entity Extraction
10	Research And Implementation Of Modify Chinese Part-of-Speech Tagging Based On FST Technology