Font Size: a A A

Research On Chinese Web Information Retrieval Of Model Based On Statistical Language

Posted on:2013-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2218330371991598Subject:Information Science
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet, information has grown exponentially, accessing information becomes more and more diverse, but information search has become even more complicated. An urgent need for high-level information processing technology to handle the vast amounts of information, and retrieve the necessary information to quickly to help people make better decisions and research. However, the popularity and wide application of information processing technology is largely thanks to the development of natural language processing technology, in order to solve the problem of information retrieval effectively, the research of information retrieval in the document content, the retrieval model, matching strategy and sorting algorithms gradually increasing. Retrieval model is still a hot topic of information retrieval research, a variety of retrieval models and methods have emerged, such as:boolean model, vector space model, probabilistic model. Especially in recent years, put forward a statistical language model, combining the natural language and statistical, with a strong mathematical basement, statistical language models become dominant in the information retrieval model, and has made a lot of research.On the basis of large-scale Chinese web corpus CWT200G, reference the information retrieval standard procedures of TREC and SWEM, combining the working platform of Lemur with word components which is Chinese lexical analysis system ICTCLAS of the Chinese Academy of Sciences's products, and available a simple information retrieval system. First of all, described the theoretical basis of this paper describes the need to study the key issues in the study of Chinese Web information retrieval method based on statistical language model:statistical language model, data smoothing, Chinese word segmentation and Chinese text indexing. Then a brief introduction on the Chinese Web page corpus of information retrieval evaluation and experimental platforms required, and system and do a detailed analysis of the data is how to deal with. Finally, the experimental comparison of the data analysis of the pros and cons of the traditional vector space model, probabilistic model of information retrieval models and statistical language model on the Chinese Web page corpus theme retrieval performance; the same time, the topic retrieval experiments in the statistical language model, respectively Simplified Jelinek-Mercer smoothing method,Dirichlet Prior smoothing methods and the Absolute Discouting smoothing method, and compare the performance of the three smoothing methods in information retrieval.
Keywords/Search Tags:Statistical language models, Chinese web information retrieval, Datasmoothing techniques, Chinese word segmentation
PDF Full Text Request
Related items