The Study Of Non-stationary Language Modeling Techniques And Its Practices

Posted on:2008-06-03

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J H Xiao

Full Text:PDF

GTID:1118360245997432

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Language model is a mathematic description of natural language, which is usu-ally presented as a formalized system to explain and exploit the principle of language.The study of language model is fundamental in the research area of natural languageprocessing. Its achievements can apply to Chinese Pinyin-to-Character Conversiontask directly, and promotes many tasks of natural language processing, includingspeech recognition, handwriting recognition, optical character recognition, machinetranslation, information retrieval, multi-level processing of corpus, and so on.In these days, the quantity of digit text increases rapidly on the internet. Thestochastic techniques become the main way to language modeling due to its high ac-curacy and strong robustness. The stochastic language model becomes the most preva-lent language model. However, it takes natural language as a stochastic chain fromthe statistical view purely, ignoring the characters of language. It is one of the chal-lenges to involve linguist knowledge in stochastic language model. However, there aretwo problems to combine the linguist knowledge with the current stochastic languagemodel directly: 1. it is difficult to acquire the precise linguist knowledge automati-cally; 2. it is hard to integrate the linguist knowledge into the current framework oflanguage model.In order to solve the above problems, this paper represents the positional infor-mation of language element formally and exploits their principles in language mod-eling. Concretively speaking, language element plays different roles in different por-tions of sentence due to its syntax and semantic property. Therefore, the probabilityof language element is relevant to its positional information. In order to exploit thepositional information, the stationary hypothesis of traditional language element is re-laxed and the non-stationary hypothesis is made: the occurrence of current languageelement is determined partially by its position in the sequence of language elements.Based on the above hypothesis, the paper focuses on the studies of the theory, thetechnique, the method and the related issues of non-stationary language modeling. Fi-nally, these techniques are applied to the Chinese Pinyin-to-Character conversion taskso as to improve the performance. The paper mainly consists of four parts: Firstly, the paper does the resource preparation and proposes a Chinese lexi-con construction algorithm for language modeling. It combines the Chinese lexiconconstruction with language modeling and presents a unified framework of iterationalgorithm. The performance of current language model is improved by optimizingthe lexicon. Under the framework, a multi-feature lexicon construction algorithm isproposed which exploits both the statistical feature and the lexical feature. Finally,two heuristic methods are proposed to make the system self-adaptive the domain oftraining corpus.Secondly, the paper studies the theory and the technique of non-stationary lan-guage modeling. First of all, the paper provides the formal representation of positionalinformation of language element, based on which the principles of non-stationaryproperty of language element are induced. Then these principles are involved inthe process of language modeling. Two non-stationary language models, the non-stationary Ngram model and the non-stationary Maximum Entropy Markov model,are proposed. Several related issues, including the model construction, the trainingalgorithm, the smoothing technique and the model complexity, are well discussed.Finally, these models are verified on the Pinyin-to-Character conversion task and thePos-tagging task respectively.Thirdly, the paper proposes the semantic-based smoothing technique so as tosolve the data sparseness problem of language model. It acquires the semantic infor-mation from Hownet and TongyiciCilin, and then combines them with the traditionalsmoothing techniques. The iterative algorithms are designed to optimize the parame-ters automatically.Fourthly, the paper applies the techniques of language modeling on Chinese key-board input method. First of all, it proposes the Key-to-Pinyin conversion task for thedigit keyboard of mobile devices. Two kinds of solutions are provided and verified inthe experiments. Then, it improves the performance of the current Pinyin-to-Characterconversion system by exploitation of the pinyin constraint inputted by users. A class-based Maximum Entropy Markov model is proposed to describe both the constraintsfrom pinyin and the ones between characters. The experimental results show that thepinyin constraints improve the performance of Pinyin-to-Character conversion taskeffectively.

Keywords/Search Tags:

Non-stationary property, Language model, Smoothing technique, Pinyin-to-Character conversion, Key-to-Pinyin conversion

PDF Full Text Request

Related items

1	The Key Technology Research And Implementation Of The Pinyin-to-character Convertion System
2	Research And Application Of Statistical Language Model
3	Design And Implementation Of Intelligent Pinyin Input Method Based On Android Platform
4	Research Of Pinyin Input Method For Non-Chinese Native Chinese Learners
5	Research And Design Of Pinyin Input Method For Chinese Teaching In Pirmary And Secondary Schools
6	Auxiliary English Writing Method Based Chinese Pinyin
7	Pinyin Conversion Based On Neural Networks
8	A Study On Related Problems Of Chinese Input Method
9	The Continuous Chinese Pinyin Input System Based On Slide Track
10	Research And Implementation Of Web-based Learning System Pinyin