Font Size: a A A

The Research And Application Of Full-Text Search System Based On Lucene

Posted on:2011-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J C SuFull Text:PDF
GTID:2178360305960292Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of Internet technology and improvement of informationization, information resources on the network increase rapidly, with a variety of forms. Information retrieval is a technology that searches the information users need from massive information resources. Therefore, how to obtain the most-needed information from the massive and unstructured data resources has become another major challenge of modern information retrieval.As an important branch of modern information retrieval technology, Full-text retrieval is not only an important tool for dealing with unstructured data, but also one of main technologies on search engine. In order to improve accuracy and efficiency of retrieval, it is necessary to research the relevant technologies of information retrieval, such as index model, segmentation technology and results sorting algorithms and etc..This paper firstly researched on full-text index model of information retrieval, then improved the algorithms for Chinese Segmentation and page ranking. Finally, based on Lucene Search Engine, a good open-source framework for full-text searching, a full-text searching system was built to verify the improved performance of full-text retrieval system. Major task of this paper is as follows:(1) Research on indexing model of inter-relevant successive treeThis paper discussed and compared to several existing popular models of full-text index, and particularly introduced the inter-relevant successive tree ("IRST" for short) index model. Its feature includes fast creating index, high efficiency and restoring original text by index, etc.. The paper further studied the IRST index model, and proposed sorted successive index model based on inter-relevant successive tree, which can retrieve expectant results quickly by intersection of sorted subtrees.(2) Research and improvement of Chinese Segmentation technologyIn order to improve speed of segmentation, this paper applied a new algorithm as the data structure of glossary, called Inter-relevant successive tree. The paper also analyzed ambiguous words in the process of segmentation, and applied the "three-stage and first-word spacing method" to improve the accuracy of segmentation. The experiment results show that it's an excellent segmentation with high accuracy and time efficiency.(3) Research and improvement of Page ranking This paper analyzed current main algorithms of page ranking, and improved the current Page Ranking algorithm being widely used. The experiment results testified that it's a result-sorting algorithm with higher accuracy.(4) Design and implementation of a news retrieval system based on LuceneWith above-mentioned technologies, the Author designed and implemented a news retrieval system based on Lucene. Experiment results show that the improved full-text search system could support Chinese retrieval better and provide most-needed messages for users more accurately.
Keywords/Search Tags:Lucene, Full-Text Index Model, Inter Relevant Successive Tree (IRST), Chinese Segmentation, Page Ranking Algorithm
PDF Full Text Request
Related items