Font Size: a A A

Chinese Natural Language Search Engine Based On Lucene

Posted on:2010-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:C C HuFull Text:PDF
GTID:2178360275970361Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With rapid development of Internet technology, information sharing and releasing, humanity has entered an unprecedented "information explosion" of the times. The expansion Internet information, provided us mass of information resources, but also brought us difficulties to find information. If we do not have a powerful tool to help us find and discover useful information, we will be lost in the ocean of information. Search engine is a technology to address such problem. It's a technology in the collection and found information, understanding of the information, extraction, processing and organizations, and to provide users with search services. It the best tool to bring Internet to us, and play a role as navigation. However, the search engine technology related to database management, information retrieval, artificial intelligence, natural language processing, machine learning and many other disciplines, commercial companies are not willing to share their own search technology with the public, which block the development of search engine applications. However, with open source tool Lucene, search engine developers can make a very powerful search engine simply, fast, and in a targeted manner.First of all, the word segmentation algorithm of most Chinese analyzers for the Lucene search engine does not meet the Chinese habit. In order to overcome such deficiency, this paper has proposed a new Chinese analyzer based on the maximal match algorithm and a standard dictionary. From the experimental results, the proposed word segmentation algorithm of our Chinese analyzer meets the Chinese habit. And its indexing performance is very close to that of the analyzers based on mechanical segmentation. In addition, the retrieval efficiency is greatly improved by 2-4 times and the rate of retrieval response is improved by 59%.Secondly, this paper has proposed a natural language query interface to meet user's requirement. When user submits the query sentence, system will process word segmentation, remove the relevant auxiliary word, finally extracted the core words of query and then search the words. To improve the accuracy of word segmentation, this paper combines two word segmentation algorithm and using probability to deal with ambiguities.In addition, this article research into page relevance, PageRank algorithm, Lucene scoring system and conducting the PageRank algorithm into Lucene scoring system, so that the system be able to return more important pages to the user. And this paper proposes simhash algorithm to filter the similar pages. Besides this paper improve the quick sort algorithm to make it stable and faster.Finally, this paper implements a prototype system of Chinese natural language search engine. Prototype system integrates the network resource of Shanghai Jiaotong University. Experiment proved that the prototype system is with good performance and practicality, and providing a good platform for farther research.
Keywords/Search Tags:Lucene, search engine index, retrieval, word segmentation
PDF Full Text Request
Related items