Research And Application Of Full-text Retrieval Technology Based On Lucene

Posted on:2013-01-22

Degree:Master

Type:Thesis

Country:China

Candidate:J P Ye

Full Text:PDF

GTID:2218330371964695

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid growth of network information resources, the Internet has been becoming a large information space. We are enjoying the convenience of the Internet, but at the same time, submerged in the information ocean. Under this urgent circumstance, information retrieval technology and net search engine emerge then become an important application and research subject of the Internet.Lucene is a framework of full-text retrieval; designer can do secondary development conveniently, in spite of Lucene is powerful and flexible configuration, just as a toolkit, it is short of the module of information collection, and can not implement the integrated search function. On the other hand, the Chinese analyzers of Lucene can not split Chinese vocabulary effectively.At first ,this paper analysis the whole frame structure of Lucene, to be familiar with the course and theory of creating index files, searching index files, sorting the results and so on. Then the paper introduces the technology of webpage collection and nets crawler Heritrix, analysis its frame structure and operating principle of core components. We give three methods to improve some function of Heritrix, such as, aim at the problem that downloaded pages are complex and redundant, we filter the pages and reduce the memory space by sifting URLs to ignore bad pages; aim at the problem that the rate of capture is low, we abolish the restrictions of robot protocol by alter part source code; aim at the problem that the host name queue assignment policy lead to queue overlong and some threads blocked, we establish a new policy to assignment URLs to every queue by ELF hash algorithm, and then improve the speed of capture; we have proved this three methods are effective by experiment.The paper introduces four Chinese segmentation algorithms and three classics dictionary file structures, and summarize each advantage and disadvantage, then design and implement a new Chinese analyzer. This new analyzer has third index structure which combines advantages of table structure with tree structure, reduce the memory space and improve the speed of search words; This new analyzer adopt improved forward maximum matching algorithm, its main idea is: traverse sentences from left to right, calculate the hash value of first character, then match the value in the first index, if success, superimpose the next character to the prefix string, then calculate its length, then match the length in the second index, if success, calculate the hash value of new string, then match this value in the third index, if success, record the length, and then continue append characters behind, until the current match in the first character index of the longest entry, which is similar to word for word matching of TRIE index tree and eliminate blind spot of the tradition one, at the same time it eliminates repeatedly binary chop, improve the efficiency. The paper proved the new Chinese analyzer performs well by experiment. At last the paper combines all the research and analysis, and then implements a full-text information retrieval model system base on J2EE to complete the retrieval assignments of the user.

Keywords/Search Tags:

Search engine, full-text retrieval, Lucene, Heritrix, Chinese word segmentation

PDF Full Text Request

Related items

1	The Design And Realization Of Full Text Retrieval Based On Lucene Mobile Phone
2	Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine
3	The Research And Implementation Of Full-Text Search Engine Based On Lucene
4	Research And Design Of Search Within Application System Based On Lucene
5	Design And Improvement Of Website Full-text Retrieval System Based On Lucene
6	Research And Implementation Of The Vertical Search Engine System Based On JAVA With LUCENE And HERITRIX
7	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
8	Development And Maintenance Of Full-text Retrieval Web System Based On Lucene
9	A Research On Chinese Word Segmention Based On The Combination Of Dictionary And Statistics And Full-Text Retrieval System Design
10	Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System