Research On Uyghur Full-text Information Retrieval Based On N-gram Characters Model

Posted on:2017-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:L R Xu

Full Text:PDF

GTID:2348330503984340

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the conventional Uyghurinformation retrieval, due to the flexibility and diversity of word formation and shape of Uyghur texts, system have to stem segment and index it as an index entry,however, the shortcomings and limitations of the segmentation tool itself, the part of the stem is unable to accurately recognize or identify errors, it indirectly reduces the retrieval effectiveness of the retrieval system. In order to solve problem above, according to the morphological features of the Uyghur language, by dividing the appropriate length of the character n-gram to construct index, then establish a N-gram language model in the index above, at the same time, in the process of building a language model, in order to make up the data sparse problem of a single document model, select the appropriate smoothing algorithm for optimizing document language model and corpus language model, in order to make the retrieval result more accurate, a number of models were used in the scoring process. Finally, implements a full-text retrieval system based on the character of the Uyghur n-gram model by using Lucene open source search tools, and through python crawler get Uighur news corpus to retrieval test, test results show that the parameters for the 2000 Dirchlet smoothing algorithm for the character length of 3 and the character length of 4 mixed one element model has the best search results, at same time, this method has betterperformancethan the traditional retrievalresults.

Keywords/Search Tags:

Uyghur, information retrieval, n-gram language model, smoothing, Lucene

PDF Full Text Request

Related items

1	Research Of Uyghur N-gram Model And Smoothing Algorithm
2	Research And System Implementation Of Uyghur Text Classification Based On N-gram
3	Research On Dependency Language Model For Information Retrieval
4	Using Statistical Language Modeling For Ad Hoc Information Retrieval
5	N-gram Language Model Based On Distributed System
6	Research And Improvement On Na(?)ve Bayes Test Classifier
7	Research On Information Retrieval Models Based On Statistical Language Model And Passage Feature
8	Combining Vector Space Model And Language Model To Information Retrieval
9	Information Retrieval Of Uyghur Language
10	Ontology Based Cross Language And Full Text Information Retrieval