Research On Chinese Web Information Retrieval Of Model Based On Statistical Language

Posted on:2013-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z Li

Full Text:PDF

GTID:2218330371991598

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet, information has grown exponentially, accessing information becomes more and more diverse, but information search has become even more complicated. An urgent need for high-level information processing technology to handle the vast amounts of information, and retrieve the necessary information to quickly to help people make better decisions and research. However, the popularity and wide application of information processing technology is largely thanks to the development of natural language processing technology, in order to solve the problem of information retrieval effectively, the research of information retrieval in the document content, the retrieval model, matching strategy and sorting algorithms gradually increasing. Retrieval model is still a hot topic of information retrieval research, a variety of retrieval models and methods have emerged, such as:boolean model, vector space model, probabilistic model. Especially in recent years, put forward a statistical language model, combining the natural language and statistical, with a strong mathematical basement, statistical language models become dominant in the information retrieval model, and has made a lot of research.On the basis of large-scale Chinese web corpus CWT200G, reference the information retrieval standard procedures of TREC and SWEM, combining the working platform of Lemur with word components which is Chinese lexical analysis system ICTCLAS of the Chinese Academy of Sciences's products, and available a simple information retrieval system. First of all, described the theoretical basis of this paper describes the need to study the key issues in the study of Chinese Web information retrieval method based on statistical language model:statistical language model, data smoothing, Chinese word segmentation and Chinese text indexing. Then a brief introduction on the Chinese Web page corpus of information retrieval evaluation and experimental platforms required, and system and do a detailed analysis of the data is how to deal with. Finally, the experimental comparison of the data analysis of the pros and cons of the traditional vector space model, probabilistic model of information retrieval models and statistical language model on the Chinese Web page corpus theme retrieval performance; the same time, the topic retrieval experiments in the statistical language model, respectively Simplified Jelinek-Mercer smoothing method,Dirichlet Prior smoothing methods and the Absolute Discouting smoothing method, and compare the performance of the three smoothing methods in information retrieval.

Keywords/Search Tags:

Statistical language models, Chinese web information retrieval, Datasmoothing techniques, Chinese word segmentation

PDF Full Text Request

Related items

1	Statistical Language Models Based Chinese Text Information Retrieval
2	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
3	Research And Implementation Of Chinese Word Segmentation Algorithm
4	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
5	Research And Implementation Of Chinese Auto-segmentation System
6	Research On Chinese Word Segmentation For Large Scale Information Retrieval
7	Research On Chinese Word Segmentation Strategies For Statistical Machine Translation
8	Research And Implementation Of Chinese Word Segmentation System For Enterprise Information Retrieval
9	Chinese Word Semantic Similarity Measure And Its Application In Cross-language Information Retrieval
10	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-Segmentation With Statistic