Font Size: a A A

Research On Uyghur Full-text Information Retrieval Based On N-gram Characters Model

Posted on:2017-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:L R XuFull Text:PDF
GTID:2348330503984340Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the conventional Uyghurinformation retrieval, due to the flexibility and diversity of word formation and shape of Uyghur texts, system have to stem segment and index it as an index entry,however, the shortcomings and limitations of the segmentation tool itself, the part of the stem is unable to accurately recognize or identify errors, it indirectly reduces the retrieval effectiveness of the retrieval system. In order to solve problem above, according to the morphological features of the Uyghur language, by dividing the appropriate length of the character n-gram to construct index, then establish a N-gram language model in the index above, at the same time, in the process of building a language model, in order to make up the data sparse problem of a single document model, select the appropriate smoothing algorithm for optimizing document language model and corpus language model, in order to make the retrieval result more accurate, a number of models were used in the scoring process. Finally, implements a full-text retrieval system based on the character of the Uyghur n-gram model by using Lucene open source search tools, and through python crawler get Uighur news corpus to retrieval test, test results show that the parameters for the 2000 Dirchlet smoothing algorithm for the character length of 3 and the character length of 4 mixed one element model has the best search results, at same time, this method has betterperformancethan the traditional retrievalresults.
Keywords/Search Tags:Uyghur, information retrieval, n-gram language model, smoothing, Lucene
PDF Full Text Request
Related items