Font Size: a A A

Statistical Language Models Based Chinese Text Information Retrieval

Posted on:2005-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2168360125468467Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the swift and violent development of various kinds of text information resources, the text information retrieval system has already become a tool which people obtain the useful information indispensably with. The model of text information retrieval is regarded as the mathematics foundation of text information retrieval technique. It is one of the main research directions, and has important meanings.As a tool of natural language processing, Statistical language models have already been proved that has the ability to deal with the extensive true text. The proposition of IR model formed after statistical language model combined with IR has great progress at information retrieval model research area.This paper starts with the basic principle of text retrieval models, has analysed the pluses and minuses of several kinds of traditional IR models, given the basic principle, key technology and the advantage of statistical language model based IR models. And on standard TREC Chinese experimental data collections, it answers the following two questions:(1) How about the performance of Chinese statistical language model - IR model? That is to say, Chinese statistical language models combined with IR have future?(2) How the feture selection affecting Chinese statistical language models? which feture selection method is better?To the question (1), we use SLM-IR model and two kinds of traditional IR models: Vector space model and probability model are compared, give out the performance of above-mentioned 3 kinds of models by standard TREC evaluation methods. The experimental result shows that the performance of simple SLM-CIR model should be superior to simple vector space models and probability models.To the question (2), we have chosen several kinds of typical feture selection method, namely single Chinese character, word segmentation, bigram, Compare their performance. Besides, consider the particularity of the word segmentation, we have chosen several kinds of different word segmentation methods again. The performance of SLM-IR's model of word segmentation segmentation based on different word segmentation methods has appeared. The experimental result shows: â‘ To single Chinese character segmentation, the performance of simple SLM-CIR model should be superior to simple vector space models and probability models ; To word segmentation segmentation and Bigram segmentation, the performance of simple SLM-CIR model should be superior to the vector space model. Though is slightly lower than OKAPI probability model, SLM-CIR model after feedback should be superior to OKAPI probability model obviously. â‘¡To simple SLM-CIR model, the performances of word segmentation segmentation should not be superior to Bigram segmentation and single Chinese character segmentation, and different word segmentation methods are not obvious to the influence of retrieval performance. This has proved that in SLM-CIR model, the technology of the word segmentation is not a key factor of influencing the performance of models. â‘¢We proved the conclusion drawn from English test collection, that is, whichever segmentation method we used, baysian smoothing method based on dirichlet prior is better than the other two smoothing methods.In the future, we can further investigate in semantic smoothing techniques, etc., and regard statistical language models as a powerful tool to construct more complicated IR models.
Keywords/Search Tags:statistical language models, Chinese information retrieval, smoothing techniques
PDF Full Text Request
Related items