Font Size: a A A

The Study On Basic Elements To Build Statistical Language Model Of Uyghur

Posted on:2014-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:L TangFull Text:PDF
GTID:2268330401464366Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As a mathematical model to describe the inherent disciplines of natural language,language model occupy an important position in natural language processing,so it isessential to built a reliable language model in natural language processing system.Being the basic part in natural language processing of Uyghur,Uygur language model iswidely used in the field of Speech recognition, machine translation, informationretrieval, etc. However, at the present time, the study of Uygur language model is just atthe beginning stage,so a further study of this model will be of great significance for theinformation-based development of Xinjiang region.This dissertation aimed to establish Uyghur language model according to differentmodel units. It means to find the best smoothing method and units to build Uyghurlanguage model. The contents of this dissertation are as follows:To solve the problem of data sparseness,many smoothing algorithms such asAddition smoothing, Good-Turing smoothing, Witten-Bell smoothing, Katzsmoothing, absolute discount smoothing, Kneser-Ney smoothing were studied. Theexperimental results show that the perplexity of absolute discount smoothing was best.The experimental data were collected based on Uygur spoken dialog corpora ofphone,and text corpora from bilingual teaching system and some daily expression.After pretreatment, these data were processed into Uygur text corpora. Two wordsegmentation methods were adopted,one was Uyghur words segmentation methodbased on dictionary and the other was segmented in the unsupervised form.Based on Uyghur segmentation, the traditional N-gram statistical language modelwas improved. The Uyghur words can be divided into different units, using these units,three kinds of Uyghur language model were built and N-gram Language model basedon morphemes class was proposed. In this thesis,a series of experiment were conductedusing SRILM1.5.12toolkit and MITLM0.4toolkit,the results showed that theperplexity of the Uyghur language model based on morphemes was far below that basedon word. And the confusion degree of the former was reduced to about2/3of the latterone.
Keywords/Search Tags:Uyghur, Language Model, Perplexity, Model Units, Text corpus, Morpheme
PDF Full Text Request
Related items