Chinese Natural Language Search Engine Based On Lucene

Posted on:2010-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:C C Hu

Full Text:PDF

GTID:2178360275970361

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With rapid development of Internet technology, information sharing and releasing, humanity has entered an unprecedented "information explosion" of the times. The expansion Internet information, provided us mass of information resources, but also brought us difficulties to find information. If we do not have a powerful tool to help us find and discover useful information, we will be lost in the ocean of information. Search engine is a technology to address such problem. It's a technology in the collection and found information, understanding of the information, extraction, processing and organizations, and to provide users with search services. It the best tool to bring Internet to us, and play a role as navigation. However, the search engine technology related to database management, information retrieval, artificial intelligence, natural language processing, machine learning and many other disciplines, commercial companies are not willing to share their own search technology with the public, which block the development of search engine applications. However, with open source tool Lucene, search engine developers can make a very powerful search engine simply, fast, and in a targeted manner.First of all, the word segmentation algorithm of most Chinese analyzers for the Lucene search engine does not meet the Chinese habit. In order to overcome such deficiency, this paper has proposed a new Chinese analyzer based on the maximal match algorithm and a standard dictionary. From the experimental results, the proposed word segmentation algorithm of our Chinese analyzer meets the Chinese habit. And its indexing performance is very close to that of the analyzers based on mechanical segmentation. In addition, the retrieval efficiency is greatly improved by 2-4 times and the rate of retrieval response is improved by 59%.Secondly, this paper has proposed a natural language query interface to meet user's requirement. When user submits the query sentence, system will process word segmentation, remove the relevant auxiliary word, finally extracted the core words of query and then search the words. To improve the accuracy of word segmentation, this paper combines two word segmentation algorithm and using probability to deal with ambiguities.In addition, this article research into page relevance, PageRank algorithm, Lucene scoring system and conducting the PageRank algorithm into Lucene scoring system, so that the system be able to return more important pages to the user. And this paper proposes simhash algorithm to filter the similar pages. Besides this paper improve the quick sort algorithm to make it stable and faster.Finally, this paper implements a prototype system of Chinese natural language search engine. Prototype system integrates the network resource of Shanghai Jiaotong University. Experiment proved that the prototype system is with good performance and practicality, and providing a good platform for farther research.

Keywords/Search Tags:

Lucene, search engine index, retrieval, word segmentation

PDF Full Text Request

Related items

1	The Research And Implementation Of Full-Text Search Engine Based On Lucene
2	The Design And Implementation Of Search Engine Based On Lucene
3	Design Of Real Estate Marketing System Based On Lucene Technology
4	The Research And Design On Vertical Search Engine Based On Lucene
5	Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine
6	The Research And Implementation Of Enterprise Search Engine Based On Lucene
7	Research On Key Technology Of Vertical Search Engine
8	Enterprise Search Engine Based On Lucene
9	Research And Implementation Of Subject-oriented Mobile Search Engine Based On Lucene
10	The Design And Implementation Of Knowledge Search Engine In Technology Base On Lucene