Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine

Posted on:2011-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z R Li

Full Text:PDF

GTID:2178360302964262

Subject:Computer application technology

Abstract/Summary:

With the rapid growth of network information resources, more and more attention has been paid on how to extract potentially valuable information from a massive network of information quickly and efficiently so that it can be applied in the management and decision-making effectively. Information retrieval technology can help user extract useful information they need from a mass of information. It can save user's time and increase their productivity. The mechanisms and principles of information retrieval for Chinese language and western languages are basically consistent, but because of the characteristics of Chinese language in itself, some Chinese language processing technologies must be introduced, and Chinese word segmentation technology is a very crucial part.Firstly, this article elaborated the key technologies related to Chinese full-text information retrieval, including: information retrieval concept, Chinese segmentation algorithm concept, document relevance sort algorithm concept. The article systematically compared and analyzed four kinds of main Chinese segmentation algorithm: segmentation algorithm based on string matching, segmentation algorithm based on understanding, segmentation algorithm based on statistics and segmentation algorithm based on semantic. Their respective advantages and disadvantages applied to Chinese word segmentation are summarized thoroughly. On the foundation of the Lucene original document relevance sort algorithm, the article proposed an improved sorting algorithm by using Pagerank for the secondary search based on user behavior as well as by adding extra point for the home page.The main task of the thesis is the design and implementation of a Chinese full-text information retrieval prototype system based on the Lucene search engine. It proposed various kinds of improvement regarding the algorithm and the system, namely the index pretreatment, the key word prompt's operation optimization, the introduction of stop word segmentation algorithm, the improvement of the biggest matching algorithm and the reversion biggest matching algorithm. Through the experiment, after the comparison of the improved dictionary segmentation method and the Lucene automatic segmentation method: one element segmentation method and two elements segmentation method, the superiority of the improved dictionary segmentation algorithm proposed by the article is verified. Through the users' subjective appraisal of documents by using Pagerank for the secondary search based on user behavior as well as by adding extra point for the home page, the improved document relevance sort algorithm enhanced the accuracy of the search system significantly.Finally, the thesis summarizes the design approaches and the implement steps for the Chinese full-text information retrieval system based on Lucene search engine, as well as the direction for further research and improvement.

Keywords/Search Tags:

Lucene search engine, Chinese word segmentation, document relevance sort, full-text information retrieval

Related items

1	Research On Full-text Information Retrieval Technology For We Chat Content
2	Research And Application Of Full-text Retrieval Technology Based On Lucene
3	Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System
4	The Research And Implementation Of Full-Text Search Engine Based On Lucene
5	The Research And Implementation Of Enterprise Search Engine Based On Lucene
6	Research And Design Of Search Within Application System Based On Lucene
7	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
8	"Luder" Content Based Document Search Engine
9	Design And Improvement Of Website Full-text Retrieval System Based On Lucene
10	Research And Application Of Lucene Full-text Retrieval Technology In Patent Information Service Platform