Font Size: a A A

Research And Application Of Full-text Retrieval Technology Based On Lucene

Posted on:2013-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:J P YeFull Text:PDF
GTID:2218330371964695Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of network information resources, the Internet has been becoming a large information space. We are enjoying the convenience of the Internet, but at the same time, submerged in the information ocean. Under this urgent circumstance, information retrieval technology and net search engine emerge then become an important application and research subject of the Internet.Lucene is a framework of full-text retrieval; designer can do secondary development conveniently, in spite of Lucene is powerful and flexible configuration, just as a toolkit, it is short of the module of information collection, and can not implement the integrated search function. On the other hand, the Chinese analyzers of Lucene can not split Chinese vocabulary effectively.At first ,this paper analysis the whole frame structure of Lucene, to be familiar with the course and theory of creating index files, searching index files, sorting the results and so on. Then the paper introduces the technology of webpage collection and nets crawler Heritrix, analysis its frame structure and operating principle of core components. We give three methods to improve some function of Heritrix, such as, aim at the problem that downloaded pages are complex and redundant, we filter the pages and reduce the memory space by sifting URLs to ignore bad pages; aim at the problem that the rate of capture is low, we abolish the restrictions of robot protocol by alter part source code; aim at the problem that the host name queue assignment policy lead to queue overlong and some threads blocked, we establish a new policy to assignment URLs to every queue by ELF hash algorithm, and then improve the speed of capture; we have proved this three methods are effective by experiment.The paper introduces four Chinese segmentation algorithms and three classics dictionary file structures, and summarize each advantage and disadvantage, then design and implement a new Chinese analyzer. This new analyzer has third index structure which combines advantages of table structure with tree structure, reduce the memory space and improve the speed of search words; This new analyzer adopt improved forward maximum matching algorithm, its main idea is: traverse sentences from left to right, calculate the hash value of first character, then match the value in the first index, if success, superimpose the next character to the prefix string, then calculate its length, then match the length in the second index, if success, calculate the hash value of new string, then match this value in the third index, if success, record the length, and then continue append characters behind, until the current match in the first character index of the longest entry, which is similar to word for word matching of TRIE index tree and eliminate blind spot of the tradition one, at the same time it eliminates repeatedly binary chop, improve the efficiency. The paper proved the new Chinese analyzer performs well by experiment. At last the paper combines all the research and analysis, and then implements a full-text information retrieval model system base on J2EE to complete the retrieval assignments of the user.
Keywords/Search Tags:Search engine, full-text retrieval, Lucene, Heritrix, Chinese word segmentation
PDF Full Text Request
Related items