Font Size: a A A

HMM-based Chinese Part-of-Speech Tagging And Improvement

Posted on:2012-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2178330332490700Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Part-of-Speech (POS) tagging is one of the research points on Natural Language Processing which has important significance. It involves a wide range of applications, and it plays an important foundation role in the context of Information Processing. The quality of POS tagging has a direct impact on the accuracy of all Information Processing which based on the results of POS tagging, such as Syntax Analysis, Speech Recognition, Text Classification, Text to Speech, Information Retrieval, machine translation and so on. There are some difficult in the implementation process of POS tagging. Such as, the ambiguity processing of concurrent words, the processing of unknown words and proper noun. Because of the characteristics of Chinese language itself and the restrictions on Chinese Linguistics Research, Chinese POS tagging has more difficulties and complexities.There are many ways of POS tagging, and them can be grouped into two categories as Rule-based methods and Statistical methods in general. HMM-based POS tagging is a typical example of statistical methods. Although the applications of HMM in POS tagging are very mature, but how to improve the tagging accuracies of Concurrent words and Unknown words is still focal points of the study on HMM-based POS tagging. This text bases on the tagged Chinese Corpus named《People's Daily (Jan.1998)》, establishing the second-order Hidden Markov Modes(HMM2), improving the tagging of Unknown words, by training, testing, and evaluating the model to achieve the Chinese POS tagging. As follows:(1) Because the selection of the corpus plays an important influence in the results of POS tagging, preprocessing the corpus before training and testing. The preprocessing is removing the second dimension and the sign of Proper noun tagging (continue to have the Proper nouns and their tags) to improve the accuracies of the experiments.(2) When the general HMM is carrying out the POS tagging, it is just relied on the tagging of the previous word to estimate the tagging of the current word. Considering based on the linguistic knowledge, this method is not exhaustively extract the semantic information of context. Therefore, put forward the idea that establish the second-order HMM to increase the use of the semantic information of context, thereby increasing the accuracy of the POS tagging results. In the establishment of the second-order HMM, the state transition probability which gets from the training date has been smoothed; as the same time, according to the test in the actual situation, modifying the acquisition method of observation probability, and processing the unknown words in order to further ensure the accuracy of the experiments. (3) In testing, the traditional Viterbi Algorithm can't meet the improved second-order HMM. So, making the Viterbi Algorithm to be improved and expanded in order to meet the needs of the modified second-order HMM.After the open testing of a ten thousand words on the training corpuses with the annotation of 26 tags and the other annotation of 39 tags, proving the improved second-order HMM in this text has a good effect than the general HMM and HMM2. Finally, this text gives a prospect of the development of POS tagging.
Keywords/Search Tags:speech tagging, hidden Markov model, second-order hidden Markov model, Viterbi algorithm
PDF Full Text Request
Related items