Font Size: a A A

Research On Hidden Markov Model For Chinese Natural Language Processing

Posted on:2004-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:2168360095956762Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Compare with other domains such as programming language, Natural Language Processing(NLP) is more difficult in knowledge acquisition and usage. In the early research of NLP, almost all needed knowledge were collected by linguists, like translation lexicon, all kinds of grammars. But language is a result in society developing, its thesis can not be simply collected by experts. And artificial gathered knowledge may differ in forms, too arbitrary, and cost much. The development of Internet and richness of digital resources make it possible to apply statistical method in NLP knowledge learning. This method needs no prior knowledge, more adaptive and cost less. It is quickly developing in recent years, and has acquired great success in application of speech recognition and OCR, etc. This paper, do research on Chinese statistical language modeling based on Hidden Markov Trigram model. The main contents include gathering of single language corpus, model selection, training, smoothing and compression. An object-oriented Chinese statistical language modeling toolkit is presented. The original trigram model is improved to have more capabilities of long dependency. The contributions of the paper are as fellows: First, according to characters of Chinese natural language, this paper re-estimated technique of corpus collection, model training, smoothing and compression, which are previously applied in western language modeling. The paper analyze their characters and effects on Chinese trigram respectively, and search for optimal technique combination through experiment. Second, after observing dependent phenomenon in modern Chinese, this paper suggests an improved model--LP-Trigram. This model added some type of long dependency into trigram. Meanwhile, in order to adapt structure change of the model, this paper also enlarged Viterbi algorithm, which is originally applied in common HMM Trigram search. The new model added long dependency to trigram, excluded some ambiguities, while keeping original trigram from great change of size and speed. Third, this paper also tests the performance of LP-Trigram by the example of Pinyin to Hanzi conversion system. The experiment demonstrates that LP-Trigram reduces some conversion error of traditional trigram. It make long dependency rightly expressed in HMM Trigram. At last, this paper summarizes its works and points out future works.
Keywords/Search Tags:Statistical Natural Language Processing, Hidden Markov Model, Long Dependency, LP-Trigram
PDF Full Text Request
Related items