Font Size: a A A

Research Kazakh Part Of Category Words Tagging

Posted on:2015-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:N N NiuFull Text:PDF
GTID:2298330431991887Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
POS(Part-Of-Speech) tagging is to give a correct mark for each word in the text according tothe context information, and it is an important part of natural language processing and is thebasis of machine translation, speech recognition, text categorization, information retrieval andmany other applications. However, in the process of automatic POS tagging, the unknownwords and multi-category words are two difficult problems to be solved.Aiming at this problem, the paper combines the methods of the maximum entropy (ME),the conditional random fields (CRFs) and association rules to achieve the construction ofKazakh (referred to as the Kazakh language below) multi-category words POS tagging system,the system takes full advantage of the rich context of the Kazakh information to tag POS ofthe unknown words, its implementation includes the following parts:①Firstly exploitsexisting Kazakh labeling systems to pretreatment corpus, and based on it manuallyproofread some of the files selected which will be used as train sets, while the other part ofthe corpus is used as test sets;②According to the corpus, the paper use the maximumentropy model to make a research on Kazakh multi-category words, when the POS ofmulti-category words is sole in the context of corpus, the maximum entropy model is used toachieve part of speech tagging, otherwise, combining ME model with path searching methodin order to acquire better result of multi-category words POS tagging;③CRFs is used torealize multi-category words POS tagging, whose features template is selected by using thefeature template automatic selection method, and then corpus is converted into the formatrequired by CRF toolkit and based on the extracted feature template the paper trains themodel to get CRF probability model, and finally tags the corpus correctly;④Associationrules are used to extract training corpus, and on the basis of these rules, the paper optimizesthe corpus annotated by CRFs further.In this paper, the methods described above is applied to the Xinjiang Daily corpus,experiments show that the tagging effects of CRFs are better than EM; Association rules-based approach optimizes the marked results from CRFs further so that Kazakhmulti-category words and overall word POS tagging has achieved better results.
Keywords/Search Tags:Kazakh, multi-category words, POS tagging, EM, CRFs, association rules
PDF Full Text Request
Related items