Font Size: a A A

The Study Of Non-stationary Language Modeling Techniques And Its Practices

Posted on:2008-06-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H XiaoFull Text:PDF
GTID:1118360245997432Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Language model is a mathematic description of natural language, which is usu-ally presented as a formalized system to explain and exploit the principle of language.The study of language model is fundamental in the research area of natural languageprocessing. Its achievements can apply to Chinese Pinyin-to-Character Conversiontask directly, and promotes many tasks of natural language processing, includingspeech recognition, handwriting recognition, optical character recognition, machinetranslation, information retrieval, multi-level processing of corpus, and so on.In these days, the quantity of digit text increases rapidly on the internet. Thestochastic techniques become the main way to language modeling due to its high ac-curacy and strong robustness. The stochastic language model becomes the most preva-lent language model. However, it takes natural language as a stochastic chain fromthe statistical view purely, ignoring the characters of language. It is one of the chal-lenges to involve linguist knowledge in stochastic language model. However, there aretwo problems to combine the linguist knowledge with the current stochastic languagemodel directly: 1. it is difficult to acquire the precise linguist knowledge automati-cally; 2. it is hard to integrate the linguist knowledge into the current framework oflanguage model.In order to solve the above problems, this paper represents the positional infor-mation of language element formally and exploits their principles in language mod-eling. Concretively speaking, language element plays different roles in different por-tions of sentence due to its syntax and semantic property. Therefore, the probabilityof language element is relevant to its positional information. In order to exploit thepositional information, the stationary hypothesis of traditional language element is re-laxed and the non-stationary hypothesis is made: the occurrence of current languageelement is determined partially by its position in the sequence of language elements.Based on the above hypothesis, the paper focuses on the studies of the theory, thetechnique, the method and the related issues of non-stationary language modeling. Fi-nally, these techniques are applied to the Chinese Pinyin-to-Character conversion taskso as to improve the performance. The paper mainly consists of four parts: Firstly, the paper does the resource preparation and proposes a Chinese lexi-con construction algorithm for language modeling. It combines the Chinese lexiconconstruction with language modeling and presents a unified framework of iterationalgorithm. The performance of current language model is improved by optimizingthe lexicon. Under the framework, a multi-feature lexicon construction algorithm isproposed which exploits both the statistical feature and the lexical feature. Finally,two heuristic methods are proposed to make the system self-adaptive the domain oftraining corpus.Secondly, the paper studies the theory and the technique of non-stationary lan-guage modeling. First of all, the paper provides the formal representation of positionalinformation of language element, based on which the principles of non-stationaryproperty of language element are induced. Then these principles are involved inthe process of language modeling. Two non-stationary language models, the non-stationary Ngram model and the non-stationary Maximum Entropy Markov model,are proposed. Several related issues, including the model construction, the trainingalgorithm, the smoothing technique and the model complexity, are well discussed.Finally, these models are verified on the Pinyin-to-Character conversion task and thePos-tagging task respectively.Thirdly, the paper proposes the semantic-based smoothing technique so as tosolve the data sparseness problem of language model. It acquires the semantic infor-mation from Hownet and TongyiciCilin, and then combines them with the traditionalsmoothing techniques. The iterative algorithms are designed to optimize the parame-ters automatically.Fourthly, the paper applies the techniques of language modeling on Chinese key-board input method. First of all, it proposes the Key-to-Pinyin conversion task for thedigit keyboard of mobile devices. Two kinds of solutions are provided and verified inthe experiments. Then, it improves the performance of the current Pinyin-to-Characterconversion system by exploitation of the pinyin constraint inputted by users. A class-based Maximum Entropy Markov model is proposed to describe both the constraintsfrom pinyin and the ones between characters. The experimental results show that thepinyin constraints improve the performance of Pinyin-to-Character conversion taskeffectively.
Keywords/Search Tags:Non-stationary property, Language model, Smoothing technique, Pinyin-to-Character conversion, Key-to-Pinyin conversion
PDF Full Text Request
Related items