Font Size: a A A

Automatic Morphological Analysis And Corpus Construction Of Modern Kazakh

Posted on:2016-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2175330470964947Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
Kazakh language belongs to the Turkic language groupe, and it is a kind of typical agglutinative language. Domestic Kazakh written language use Arabia alphabet to spelling, which is alphabetic writing.it consists of 24 consonants and 9 vowels and a soft sign, useing the international code Unicode text encoding. In terms of structure, the word usually including root, stem (root and affix), additional components (affix and suffix) three parts. Kazakh language is rich morphology language, there is a large numbers of constructure suffixs, usually a suffix expression corresponds to a grammatical meaning, when we need to express a variety of grammatical meaning, we can adding several suffixI in proper sequence to express different grammatical meaning. Kazakh language’s grammar structure validation and ending with link configuration rules provides a theoretical basis for the Automatic morphological analysis.Corpus linguistics and natural language information processing has a complementary relationship, large-scaled corpus need deal with the natural language with the method of statistical language model. In the Kazakh language, automatic morphological analysis is the premise of constructing the corpus. The main taskof Kazakh automatic morphological analysis is to realize the lemmatization and part of speech tagging, which is the basis of the Kazakh Natural Language Processing. Stemming is for a given word, through the automatic morphological analysis to To extracts the effective string which expressed the original meaning of the word, and separates each additional components which expressde the grammatical meaning. Lemmatization and part of speech tagging is an important link of agglutinative language’s Natural Language Processing lexical analysis, in the Kazakh language, the main problems encountered are Lemmatization ambiguity, unknown words and stem’s irregular deformation reduction.The Balanced Corpus Based on normative and accessibility as the basic principle, choice the Kazak language web resource from people.com.cn as the source of the corpus. Through the analysis of the self processing and automatic configuration, implementation of stemming and part of speech of 207000 words Web corpus annotation, and then constructing a corpus. The establishment of Kazakh language corpus has direct and practical value., it can provides a corpus based approach for the Kazakh language researchers, and provides convenience for the language teaching, dictionary compilation and Machine Translation etc..
Keywords/Search Tags:Kazakh, corpus, automatic morphological analysis, dynamic
PDF Full Text Request
Related items