Font Size: a A A

Research On Chinese Parts Of Speech Tagging And POS Guessing Over Unknown Words

Posted on:2016-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:H M LiuFull Text:PDF
GTID:2308330464464466Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese part-of-speech tagging (POS) is a technology which assigns a correct part-of-speech tag to each word in sentence when it is processed according to its context. Chinese POS is the basis of natural language processing (NLP) since it provides necessary information for subsequent tasks such as information extract, syntactic analysis, machine translation, etc.The unknown words, also known as new words, are the words that not included in the reference dictionary or not appeared in the training corpus. The Unknown Words POS Guessing is a technology of adopting a certain method to assign a correct part-of-speech tag to the unknown word.The Chinese POS tagging face two difficulties:multi-category words POS tagging and unknown words POS guessing. The improvement on multi-category words POS tagging and the unknown words POS guessing can improve the effect of Chinese POS tagging. Therefore, we did a deep research on Chinese part-of-speech tagging and unknown words POS guessing in this article.In this article, we firstly analyzed and summarized current various research methods, then proposed two models:a multi-category words POS tagging method based on rule acquisition; an unknown word POS guessing method based on combination model. The contribution and innovation of this paper are listed as follows:1. POS Tagging using line labeling model. We used two classical models:Maximum Entropy Model and Hidden Markov Model to POS tagging and realizes the two baseline systems.2. POS Tagging using joint model method. In order to avoid disadvantages of line labeling model, we used a joint model method to process word segmentation and POS tagging concurrently. Then we proposed a multi-category words POS tagging method based on rule acquisition and an unknown word POS guessing method based on combination model.3. We used mutual information to acquire the rules for multi-category words POS tagging and set priorities to each rule in the rule banks. A model which is combined with three models was applied to predict the POS of unknown word. They are rule model, Trigram model and Character position model.4. The results of the experiments in People’s Daily database of 2000 show that the joint model achieves much better performance with the accuracy of 96.46% than the other two traditional baseline systems, and which confirms the validity of the methods proposed in this article.
Keywords/Search Tags:POS tagging, multi-category words, unknown words, Jiont Model
PDF Full Text Request
Related items