An Analysis Of Kazak 's Lexical Method Based On Web Corpus

Posted on:2016-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:H F Liu

Full Text:PDF

GTID:2208330470466828

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Word stemming and POS (part of speech) tagging is an important part of the basis of information extraction, text classification, information retrieval, speech recognition and natural language processing. In Kazakh information processing, word stemming and POS are also the main content of Kazakh lexical analysis.In order to do stemming and POS tagging research, we built Kazakh-Chinese dictionary of 49,613 words, extracted 32,562 Kazakh word stems, summarized 311 Kazakh word-formation rules, built web corpus of 216,000 Kazakh words and manually tagged corpusâ€™s POS of 128,000 Kazakh words. And developed a user-friendly interactive POS tagging system.In word stemming, we proposed a method of combining rule-based, dictionary lookup and maximum matching. We takes account of the vowel harmony, stemâ€™s POS (part of speech), and suffix conjugation access order in Kazakh word stem according to the Kazakh word special formation. The accuracy rate reaches 95.6%, is 3.26% higher than the use of rule-based and lexical analysis combination method, the results show that this method is good.In the POS tagging, we use HMM (Hidden Markov Model) and CRFs (Condition Random Fields) methods to tag Kazakh POS. Based on the first-order HM, we give a second-order HMM in Kazakh POS tagging, the accuracy rate is 0.83% higher than first-order HMM, reach 79.2%. Due to the label bias problem of HMM, we use CRFs for compare. By optimized the CRFs POS tagging template, the final accuracy reaches 89.48%. The results show that CRFs performed better than HMM in Kazakh POS tagging.

Keywords/Search Tags:

Kazakh, Stemming, Part-Of-Speech Tagging, HMM, CRFs

PDF Full Text Request

Related items

1	Research Kazakh Part Of Category Words Tagging
2	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
3	The Development Of Part-of-speech Tagging Software For Kazakh Language
4	Research On Methods For Kazakh Lexical Analyzing And Phrase Parsing Based On Rules And Statistics
5	Research And Implementation Of Modify Chinese Part-of-Speech Tagging Based On FST Technology
6	Research On Lao Language Part-of-speech Tagging With Multiple Features
7	Research On Laodian Participle And Part-of-speech Tagging Method
8	Research On The Construction Method Of Burmese Part-of-speech Tagging Corpus
9	A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion
10	The Key Issues In Uyghur Natural Language Processing