Font Size: a A A

Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine

Posted on:2011-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2178360302964262Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of network information resources, more and more attention has been paid on how to extract potentially valuable information from a massive network of information quickly and efficiently so that it can be applied in the management and decision-making effectively. Information retrieval technology can help user extract useful information they need from a mass of information. It can save user's time and increase their productivity. The mechanisms and principles of information retrieval for Chinese language and western languages are basically consistent, but because of the characteristics of Chinese language in itself, some Chinese language processing technologies must be introduced, and Chinese word segmentation technology is a very crucial part.Firstly, this article elaborated the key technologies related to Chinese full-text information retrieval, including: information retrieval concept, Chinese segmentation algorithm concept, document relevance sort algorithm concept. The article systematically compared and analyzed four kinds of main Chinese segmentation algorithm: segmentation algorithm based on string matching, segmentation algorithm based on understanding, segmentation algorithm based on statistics and segmentation algorithm based on semantic. Their respective advantages and disadvantages applied to Chinese word segmentation are summarized thoroughly. On the foundation of the Lucene original document relevance sort algorithm, the article proposed an improved sorting algorithm by using Pagerank for the secondary search based on user behavior as well as by adding extra point for the home page.The main task of the thesis is the design and implementation of a Chinese full-text information retrieval prototype system based on the Lucene search engine. It proposed various kinds of improvement regarding the algorithm and the system, namely the index pretreatment, the key word prompt's operation optimization, the introduction of stop word segmentation algorithm, the improvement of the biggest matching algorithm and the reversion biggest matching algorithm. Through the experiment, after the comparison of the improved dictionary segmentation method and the Lucene automatic segmentation method: one element segmentation method and two elements segmentation method, the superiority of the improved dictionary segmentation algorithm proposed by the article is verified. Through the users' subjective appraisal of documents by using Pagerank for the secondary search based on user behavior as well as by adding extra point for the home page, the improved document relevance sort algorithm enhanced the accuracy of the search system significantly.Finally, the thesis summarizes the design approaches and the implement steps for the Chinese full-text information retrieval system based on Lucene search engine, as well as the direction for further research and improvement.
Keywords/Search Tags:Lucene search engine, Chinese word segmentation, document relevance sort, full-text information retrieval
PDF Full Text Request
Related items