Font Size: a A A

An Analysis Of Kazak 's Lexical Method Based On Web Corpus

Posted on:2016-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:H F LiuFull Text:PDF
GTID:2208330470466828Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Word stemming and POS (part of speech) tagging is an important part of the basis of information extraction, text classification, information retrieval, speech recognition and natural language processing. In Kazakh information processing, word stemming and POS are also the main content of Kazakh lexical analysis.In order to do stemming and POS tagging research, we built Kazakh-Chinese dictionary of 49,613 words, extracted 32,562 Kazakh word stems, summarized 311 Kazakh word-formation rules, built web corpus of 216,000 Kazakh words and manually tagged corpus’s POS of 128,000 Kazakh words. And developed a user-friendly interactive POS tagging system.In word stemming, we proposed a method of combining rule-based, dictionary lookup and maximum matching. We takes account of the vowel harmony, stem’s POS (part of speech), and suffix conjugation access order in Kazakh word stem according to the Kazakh word special formation. The accuracy rate reaches 95.6%, is 3.26% higher than the use of rule-based and lexical analysis combination method, the results show that this method is good.In the POS tagging, we use HMM (Hidden Markov Model) and CRFs (Condition Random Fields) methods to tag Kazakh POS. Based on the first-order HM, we give a second-order HMM in Kazakh POS tagging, the accuracy rate is 0.83% higher than first-order HMM, reach 79.2%. Due to the label bias problem of HMM, we use CRFs for compare. By optimized the CRFs POS tagging template, the final accuracy reaches 89.48%. The results show that CRFs performed better than HMM in Kazakh POS tagging.
Keywords/Search Tags:Kazakh, Stemming, Part-Of-Speech Tagging, HMM, CRFs
PDF Full Text Request
Related items