Font Size: a A A

Research And Application Of Sorting Algorithm Based On Lucene

Posted on:2016-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:C DingFull Text:PDF
GTID:2308330482481325Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since the twenty-first Century, the Internet has been developing rapidly, the information resources of the Internet are more and more abundant, and the information quantity shows an exponential growth. People become more and more close to the Internet and the information query are getting increasingly frequent. In order to search the required information resources accurately and quickly in the vast amount of information resources database, search engine tool is particularly important. Search engine is the application software system for Internet information searching, the system collect information on the Internet in a certain acquisition strategy, and provide Internet information service for the user query after information processing. In order to promote the development of search engine technology, the Apache foundation has launched an open-source full-text search engine toolkit Lucene. How to develop the search engine based on Lucene has become a hot issue in the basic search engine field.In this thesis, two research schemes were adopted. First, a sort algorithm of data was studied based on the function of Lucene full text searching. Inverted algorithm has fast query speed but small storage space, it can provide a ranking query function but not support fast phrase query, and furthermore, it is not very suitable for Chinese words with undetermined language border. The suffix tree and suffix array index model can support the phrase query and self-index function; it has a very good adaptability for word with uncertain boundaries but does not support Sorting Query. After comparing their advantages and disadvantages, we then draws the conclusion that Inverted algorithm is suitable for Lucene full-text retrieval field algorithm. An improved data sorting algorithm, SA-PL index model, was proposed. The model can support the phrase query, indexing and word boundaries uncertain linguistic adaptability using the suffix array, and suffix tree characteristics compared with small storage space, the suffix array combined with inverted list. According to the concept of SA-PL index model, we designed the SA-PL-0index model. The SA-PL-1 index mode which can further compress the index space by removing the short inverted lists based on SA-PL-0 was proposed. The model can improve the query speed and reduce storage space, so as to realize the high efficiency of Lucene under the environment of data sorting. Finally, we selected an appropriate platform and environment for test, the improved algorithm experiments show that theSA-PL-0 and SA-PL-1 index model can provide a ranking query and phrase query and self-index function of word boundaries uncertain language has very good adaptability,the index storage space and query time comprehensive performance significantly better than the previous index model.
Keywords/Search Tags:Lucene, inverted index, suffix array, SA-PL index model
PDF Full Text Request
Related items