Font Size: a A A

For Commercial Applications Based On The Lucene Search Engine, And Implementation

Posted on:2008-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:T L PanFull Text:PDF
GTID:2208360215450063Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays, the amount of the information on the Internet can be overwhelming. At the same time, the need to quickly locate information in the sea of data isn't limited to the Internet realm—desktop computers can store increasingly more data. With this abundance of information, and with time being one of the most precious commodities for most people, we need to be able to make flexible, freeform, instant queries which can quickly cut across rigid category boundaries and find exactly what we're after while requiring the least effort possible. The emergence of several search engines with varying capabilities solves this problem. The search engine technology is becoming the hotspot of research and development both in computer industry and academic world.Search engine is an application software system which searches and collects information by certain strategy, organizes and processes the information, and then provides the information inquiry service for the users.This article firstly explores the background and history of search engine and gives a brief introduction of 4 famous Chinese search engines while pointing out the actuality and trend of search engine development. Then, the basic theories, relative technologies and procedure of the search engine are explained. We could have further understaning of the difficulties to realize a Chinese search engine by investigating the core technologies about Chinese search engine. In the written Chinese, there is no delimiter (such as the spaces in the written English) between the words, and word segmentation, which means breaking a sentence into words, is an essential task for Chinese language processing. This article analyzes and compares the existing Chinese word segmentation methods. An effective method of Chinese word segmentation is achieved by improving the shortest-paths algorithm.Lucene, a mature, free, open-source project implemented in Java is introduced. Lucene is a high performance, scalable Information Retrieval (IR) library. We could master the essential of Lucene by the analysis of the source code and the experimental programming. Due to the simple yet powerful core API, Lucene is able to be integrated into our application rapidly.Finally, this article illustrates an implementation of search engine in the Best-tone system owned by China Telecom. The main modules and their fuctions in the application are explained in detail. We can see that it is a nice try to customize Lucene to obey specified business rules by implementing the Chinese word segmentation process and adopting suitable weight-sort algorithm.
Keywords/Search Tags:Lucene, Search Engine, Information Retrieval, Business Application
PDF Full Text Request
Related items