Font Size: a A A

The Key Issues In Uyghur Natural Language Processing

Posted on:2016-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:L D T E X PaFull Text:PDF
GTID:1318330482977458Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the growing of computer technologies, especially the getting popularity and fast development of Internet, the Natural Language Processing is receiving importance from the researchers on computer science field. Meanwhile, the information processing studies for minority ethnic groups are also being emerged and going ahead day by day. The "One Belt and One Road" strategy of central government gives the development of information processing technology for minority groups more significance and provides the precious opportunity on one hand, also, on the other hand, puts great challenge for the studies on it.taking the multi-model, multi-coded and adhesive properties of Uyghur characters into consideration, this paper proposes a code transformation technique between the Arabic based and Slavic based scripts for Uyghur language. Then, Some more specific studies on NLP techniques including part-of-speech tagging (POS), word-stem extraction are elaborated. Finally, on this script code transformation platform, the results of comparing experiments using three classification algorithms are analyzed and some conclusions are given. Main research contents are as follows:1. This paper presented a rule and dictionary combination based code transformation method. Binary search method is used due to the binary structure of documents in corpus. The code transformation system has been developed using Microsoft middle-ware platform and an automatic code transformation system has got realized.2. This work proposed a maximum entropy based POS tagging model for Uyghur harmonious stems and affixes, established a POS tagging feature template and designed its corresponding feature function. The experimental analysis demonstrated that the maximum entropy based model has acceptable capability to handle Uyghur POS tagging of ambiguous and unlisted words, and obviously outperforms than other POS taggers for Uyghur words.3. A multi-strategy integrated Uyghur word-stem extraction method is proposed. This work designed a word-stem segmentation approach based on rule and dictionary combination, maximum entropy and limited status automation. The results from the experiments on authoritative data corpus put the fact that the proposed approach has improved the word-stem extraction accuracy for nouns.4. This paper introduces the Uyghur text classification technology, and a text corpus for classification performance test is established. The implementation of stem extraction and CHI statistical feature selection methods showed great contribution in dimension reduction. On a substantial volume of text corpus, this paper analyzed the performance of KNN, NB and SVM classification algorithms for the Uyghur texts. The comparative experiments showed that SVM can provide the best performance among the three algorithms.To sum up, this paper conducted analysis and research on Uyghur natural language processing techniques including character code transformation, POS tagging, stem extraction and their effects on text classification. The comparative studies on experiments acquired some valuable results and provided reference for later research.
Keywords/Search Tags:Uyghur Information processing, Part-of-speech tagging, codding convert, stemming, text classification
PDF Full Text Request
Related items