Font Size: a A A

The Research Of Part-of-speech Tagging Based On Hidden Markov Model

Posted on:2014-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:X P NiuFull Text:PDF
GTID:2268330401477731Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of computer technology has brought the revolutionary change to people’s life. People want to be able to communicate with computer more efficiently. Thus natural language processing technology arises at the historic moment. As an important basic research subject of Natural Language Processing, Part-of-Speech Tagging has a profound significance and wide application. It is usually act as a pre-processor in NLP systems, therefore, its accuracy is vital for follow-up work, and even the entire system. So the POS Tagging must provide highly accurate intermediate results for the subsequent process of NLP.As the most important attribute to a word, part-of-speech is the main link to connect a word to the syntax.It can provide large amounts of important information about the vocabulary and its context. At the same time, it can also provide information about pronunciation, which is very useful for speech recognition model. The tagged text is the most basic training corpora in natural language processing, if there is no such tagged corpus, natural language processing is just talk.At present, the research of part of speech tagging has basically become mature, and the main methods are:part-of-speech tagging method based on rules, statistics, rules combined with statistical methods, and transformation-based error-driven learning methods. Part-of-speech tagging is becoming more and more widely used, and mainly in the fields of:machine translation, text classification, automatic summarization, text proofreading, speech recognition, speech synthesis, corpus processing, information retrieval etc.This article mainly works on the following three aspects to improve the accuracy of POS Tagging. First of all, on the basis of realization of the traditional HMM, improve the order of the model and implement the second-order HMM. Therefore more useful context information can be incorporated in the second-order model than in the traditional one. As a result the accuracy of POS Tagging will also be improved. Secondly, the existing smoothing algorithms still lack of in-depth research and analysis of their performance, so it is hard to choose the optimum smoothing algorithm. Thus I explain the factors that affect the performance of the existing smoothing algorithm, and which algorithms work well in which situations. Then select the most suitable algorithm to process the data taking our model and the size of our training corpus into account. Finally, to deal with the problem of how to incorporate unknown words processing in the statistical model, I use word emission probability to estimate the probability of unknown words. More importantly, I introduce a way to deal with the high-frequency unknown words in the documents on certain professional field. In this paper, I have realized the hidden markov model on the platform of Eclipse using JAVA. And also implemented the improved methods mentioned in this article, then tested in both English and Chinese corpus. Experiments show that by using second-order HMM joined with the algorithm for smoothing data and processing unknown words, we can achieve more ideal results of part-of-speech tagging. At the same time, adding the high frequency words into the training corpus can help to build a more accurate, standardized and perfect corpus.
Keywords/Search Tags:part-of-speech tagging, hidden markov model, smoothingalgorithm, high-frequency unknown words
PDF Full Text Request
Related items