Research Kazakh Part Of Category Words Tagging

Posted on:2015-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:N N Niu

Full Text:PDF

GTID:2298330431991887

Subject:Computer application technology

Abstract/Summary:

POS(Part-Of-Speech) tagging is to give a correct mark for each word in the text according tothe context information, and it is an important part of natural language processing and is thebasis of machine translation, speech recognition, text categorization, information retrieval andmany other applications. However, in the process of automatic POS tagging, the unknownwords and multi-category words are two difficult problems to be solved.Aiming at this problem, the paper combines the methods of the maximum entropy (ME),the conditional random fields (CRFs) and association rules to achieve the construction ofKazakh (referred to as the Kazakh language below) multi-category words POS tagging system,the system takes full advantage of the rich context of the Kazakh information to tag POS ofthe unknown words, its implementation includes the following parts:â‘ Firstly exploitsexisting Kazakh labeling systems to pretreatment corpus, and based on it manuallyproofread some of the files selected which will be used as train sets, while the other part ofthe corpus is used as test sets;â‘¡According to the corpus, the paper use the maximumentropy model to make a research on Kazakh multi-category words, when the POS ofmulti-category words is sole in the context of corpus, the maximum entropy model is used toachieve part of speech tagging, otherwise, combining ME model with path searching methodin order to acquire better result of multi-category words POS tagging;â‘¢CRFs is used torealize multi-category words POS tagging, whose features template is selected by using thefeature template automatic selection method, and then corpus is converted into the formatrequired by CRF toolkit and based on the extracted feature template the paper trains themodel to get CRF probability model, and finally tags the corpus correctly;â‘£Associationrules are used to extract training corpus, and on the basis of these rules, the paper optimizesthe corpus annotated by CRFs further.In this paper, the methods described above is applied to the Xinjiang Daily corpus,experiments show that the tagging effects of CRFs are better than EM; Association rules-based approach optimizes the marked results from CRFs further so that Kazakhmulti-category words and overall word POS tagging has achieved better results.

Keywords/Search Tags:

Kazakh, multi-category words, POS tagging, EM, CRFs, association rules

Related items

1	Research On Methods Of Chinese Word Classification And POS Tagging
2	Research On Chinese Parts Of Speech Tagging And POS Guessing Over Unknown Words
3	A Study Of The Shallow Syntactic Analysis Methods In Vietnamese
4	An Analysis Of Kazak 's Lexical Method Based On Web Corpus
5	Research On Methods For Kazakh Lexical Analyzing And Phrase Parsing Based On Rules And Statistics
6	Research On Extraction Methods Of Kazakh Common-used Words And Investigation Of Elementary School Textbooks' Words
7	Research Of Kazakh Hot Words Extraction Methods For Internet Public Sentiment
8	The Study Of Rule-based Chinese Words Tagging Method
9	The Development Of Part-of-speech Tagging Software For Kazakh Language
10	Chinese Multi-category Product Words Segmentation And Recognition Based On Electronic Commerce