The Study On Basic Elements To Build Statistical Language Model Of Uyghur

Posted on:2014-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:L Tang

Full Text:PDF

GTID:2268330401464366

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As a mathematical model to describe the inherent disciplines of natural language，language model occupy an important position in natural language processing，so it isessential to built a reliable language model in natural language processing system.Being the basic part in natural language processing of Uyghur，Uygur language model iswidely used in the field of Speech recognition, machine translation, informationretrieval, etc. However, at the present time, the study of Uygur language model is just atthe beginning stage，so a further study of this model will be of great significance for theinformation-based development of Xinjiang region.This dissertation aimed to establish Uyghur language model according to differentmodel units. It means to find the best smoothing method and units to build Uyghurlanguage model. The contents of this dissertation are as follows:To solve the problem of data sparseness，many smoothing algorithms such asAddition smoothing, Good-Turing smoothing, Witten-Bell smoothing, Katzsmoothing, absolute discount smoothing, Kneser-Ney smoothing were studied. Theexperimental results show that the perplexity of absolute discount smoothing was best.The experimental data were collected based on Uygur spoken dialog corpora ofphone，and text corpora from bilingual teaching system and some daily expression.After pretreatment, these data were processed into Uygur text corpora. Two wordsegmentation methods were adopted，one was Uyghur words segmentation methodbased on dictionary and the other was segmented in the unsupervised form.Based on Uyghur segmentation, the traditional N-gram statistical language modelwas improved. The Uyghur words can be divided into different units, using these units，three kinds of Uyghur language model were built and N-gram Language model basedon morphemes class was proposed. In this thesis，a series of experiment were conductedusing SRILM1.5.12toolkit and MITLM0.4toolkit，the results showed that theperplexity of the Uyghur language model based on morphemes was far below that basedon word. And the confusion degree of the former was reduced to about2/3of the latterone.

Keywords/Search Tags:

Uyghur, Language Model, Perplexity, Model Units, Text corpus, Morpheme

PDF Full Text Request

Related items

1	Research Of Uyghur N-gram Model And Smoothing Algorithm
2	Research On Uyghur Text Recognition In The Scene Image
3	Research And Application Of Uyghur-chinese Machine Translation Model Based On Deep Learning
4	Research On The Construction Of Uyghur Text Corpus For Sign Language Information Processing
5	Research On Entity Recognition Of Person Names In Uyghur Text Corpus
6	Research On The Technologies Of HTK Based Uyghur Continuous Phoneme Recognition
7	Research And System Implementation Of Uyghur Text Classification Based On N-gram
8	Research On Uyghur Full-text Information Retrieval Based On N-gram Characters Model
9	Based On The Stem Of The Uyghur Language Text Cluster Research And Implementation
10	An Automatic Chinese Text Categorization System Based On Statistical Language Model