Font Size: a A A

Research And Application Of Statistical Language Model

Posted on:2011-12-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J WenFull Text:PDF
GTID:1118360308462210Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Statistical Language model has made great contribution to natural language processing area. It has successfully applied to many area of computational linguistic such as speech recognition, optical character recognition, machine translation, Chinese Pinyin-to-Character conversion and Information retrieval.However, the existing language models still have some flaws. There are two serious deficiencies that affect their applying ability. First is they don't take the dependency of long distance words into consider. Moreover, the existing language models highly require the consistency of test corpus and training corpus.This paper aims at the shortage of the existing statistical language models, and does some research work on Chinese language model from several different angles. The main contents of this paper are shown as below:First, as the close relationship between language model and corpus, the use of corpus directly affects the performance of the language model. As the pretreatment before the modeling, we have accomplished a corpus searching tool which can search the specific paragraph, sentence, sub sentence and words string. And it can also use complex logical expressions and user-defined language pattern to do the searching.Second, in order to improve the long distance dependency of language model, we have made great effort on long distance dependency language model on the way of skipping unit and enlarging unit.On the way of enlarging unit, we propose a Chinese Frequent String (CFS) extract algorithm based on the string segmentation degree, and constructed n-gram model based CFS. As the granularity and average length of CFS is larger than word and character, its ability of capture long distance dependency is better than word based langague model. On the way of skipping unit, different from the previous methods based on the function word and contend word skipping, we apply deeper semantic information to our skipping model. We apply semantic grid system proposed by Xingguang Lin, and present a semantic frame based language model (SFLM). Experiments show that this kind of model can model long distance dependency and reduce model perplexity.Third, as the performance of language model is highly depend on the consistency of training corpus and testing corpus, we presents an empirical study on two training methods on model adaptation.For the generative traing aspect, we aim at the data sparseness problem and improve the K-N smoothing algorithm. Experiments show that our K-N smooting algorithm has better ability in different traning and testing corpus.For the discriminative traing aspect, we induce the N-best algorithm to the Minimum Sample Risk method, and do some research to see the model ability on self-adaptation in different kind of corpus.Fourth, as the Chinese Pinyin-to-Character conversion is an important application on Chinese language model, we also do some research on the Chinse input method using language model. Segmentation of Pinyin string is the important part while we do the Pinyin-to-Character conversion. We define the ambiguities in segmentation of Pinyin string. We classify them into overlap and combinational ambiguities, and propose disambiguity algorithms for them respectively. Experiments show good performance brought by proposed algorithms.Then we use the tri-gram language model to sentence based Pinyin-to-Character conversion. We use the A star heuristic searching algorithm to find the optimum pathway. Experiments show that our model has good ability on Pinyin-to-Character conversion.
Keywords/Search Tags:statistical language model, long distance dependency, cross-corpus adaptation, Chinese frequency string, K-N smoothing method, ambiguities in segmentation of Pinyin string, Chinese Pinyin-to-Character conversion
PDF Full Text Request
Related items