Font Size: a A A

Research And Application Of Full-text Search Based On Lucene

Posted on:2010-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:G H LuFull Text:PDF
GTID:2178360272997031Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In 2004, China Telecom began its transformation to an integrated information service provider. In the second half of 2005, China Telecom launched a new service -Best Tone, a voice search engine similar to Google, which was nicknamed "Voice Google". Best Tone is a series of value-added information services provided by China Telecom based on the telephone number information service, providing customers with a kind of facilitative service that relates to "clothing, feeding, housing, traveling and other consumptions". By dialing the service access number 114/118114, a customer can be provided with such information as food and beverage availability, tourist advice, road guide, hotel reservation and daily consumption articles.Best Tone system includes a number of subsystems, such as: basic information, corporate switchboard, etc. In this paper, we focus on the functions and realization of QC (Quality Control) subsystem, which is mainly used to store historical calls data and provide query interface for inspection personnel to conduct random checks on these data.In order to meet the requirements of the provincial company, we must be provided within three months of historical calls data for search. For this reason, the amount ofthese historical data is up to 10 million. In the old QC subsystem, we have used the database as a data storage/access layer, but we have found that this approach can'tmeet the requirements of applications:1. Low performance, it can't search more than one week of historical data at one time.2. Side effect, it may have a negative effect on other business applications, since sharing the same database.3. Lack of scalability, it is difficult to meet the needs of future business development.Therefore, we will try to get a thorough study of these issues, and supply a solution to this problem finally: A new Lucene-based full-text search QC subsystem.Full-text search refers to the computer by scanning each article word by word, a word on each set up an index to indicate the words in the article in the frequency and location; when a query requesting, the search procedure searches the indexes and return the result to the user. This process is similar to the search dictionary by looking up the catalog.Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. The Lucene project lets us index the documents on our file system or web server so we can run combined full text and metadata searches. A full text search takes one or more words of a human language as a query and should return documents which are the "most relevant" for those words. The primary goal of Lucene is to provide a fast index and query implementation. Lucene is not, by itself, designed to be a complete user-facing index solution but rather to provide the heart of such a system.In this paper, we analyze the principle of Lucene firstly, including Lucene system architecture, source code structure, as well as the organizational structure of the inverted index file, and then expounded on the basic principle and process of indexing target data source and searching the index files. And, clear and concise introduction of several useful classes, as well as how to optimize the usage of these classes. On this basis, we summed up the steps and methods of Lucene-based development.By analyzing the structure of the original QC subsystems, we can know, because the original system used the traditional database technology as its data storage/access layer, the ability of database has been unable to meet the requirements and become a bottleneck on system performance. That should be our solution is the introduction of a new data persistence layer technology to replace the database in order to achieve breakthrough the performance bottlenecks. Through our reliability, scalability,maintainability and other areas of analysis, demonstrated that the Lucene-basedreconstruction is feasible, and give a new framework for program design, index structure design, query and maintenance of the logic design and system design for fault-tolerant as follows:1. On the design of framework, at the maximum in order to maintain the originalframework, data flow, without changing the other modules, the new QCsubsystem by way of independent modules to provide services and the use of the master-slaver pattern in order to ensure the availability of Service. In addition, use the request message directly to index and reduce dependence on the database and pressure.2. On the design of index structure, first of all is the designs of the corresponding document model based on the business model, and in accordance with business requirements for each field customizes the right indexing property. And, in order to avoid the maintenance and retrieval of large index file, wedesign of the store by day: in the three month data is divided into 31 days in order to improve the system's maintainability and stability. Finally, in accordance with the performance and characteristics of Lucene, we use the memory for the current day date indexing and searching to improve the performance of the system.3. On the design of query logic, because the index files are divided into 31 days, so we can search these index files by using parallel-multi search technology:find out the serial number of target index files according to the query request, and then obtain the appropriate searchers to build a parallel-multi searcher and retrieval, at last, feedback to the user.4. On the design of maintenance logic, according to different business, mainly divided into: off-line rebuilding; real-time incremental indexing in memory; real-time modify and delete; optimize periodically. Because of the characteristics of inspection, the modification and deletion on these index files did not relate to cross-update.5. On the design of fault-tolerance, put forward three main strategies: automatic restore memory index; caches the request during the peer recovering, and thensynchronizes to peer; manually rebuild the index files. Among them, the first two are in order to guarantee the correctness of data; the latter is in order to ensure the maintainability of the system. Finally, on the basis of the above design, give the implementation, including common module, searching module, indexing module and index reconstruction module. In addition, this paper also gives some improvements for Lucene Boolean query and Wildcard query in order to improve performance of the system.At March 31, 2009, the system has been delivered, practice has proved that the new system has been designed to achieve the desired goal, and achieved good results.
Keywords/Search Tags:Full-Text Search, Inverted Index, Lucene
PDF Full Text Request
Related items