Font Size: a A A

The Research And Application Of Search Engine Based On Lucene

Posted on:2012-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:J X WeiFull Text:PDF
GTID:2178330332490758Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine is not only essential tools for Internet users but also a very practical technology, although the major search engines continue to improve and develop their own technology, which doesn't satisfy the needs of users. Higher search engine requirements will be put forward with the number of Internet's users and information increasing continuously, how timely and accurately extracting valuable information from the mass information becomes a major subject of study search engine.In this paper, the process of search engine's index and retrieval is generated from the packet of Lucene's indexing. PageRank algorithm is applied to improve the sorting performance of search engine. The strategy of spiders crawling on the web, the technology of filtering duplicate pages and the method of updating information are introduced because of the Web Crawler of Nutch system's deficiencies. The design of word segmentation algorithm is based on the maximum matching algorithm and the probability algorithms. Web text clustering's aplication using K-means clustering algorithm improve the relevance of search results. A complete system of search engine is built by using comprehensive technology.This work has made the following points:First, the realization of the full-text indexing and retrieval based on Lucene process and design the algorithm of maximum matching algorithm of maximum matching probability. Second, the design of Internet Spider uses the basic mode of Web Crawler of Nutch system,selects the PageRank algorithm as crawling strategy, removes duplicate pages basing on URL and content and utilizes Quartz job scheduling system to call crawler regularly, which will update the local amounts of page and improve the information in time.Third, improved PageRank algorithm and Lucene sorting algorithm are proposed in view of their own defects, which consider comprehensive utilization of both algorithms in order to sort more reasonablely.Clustering technology of web document is applied to improve the relevance by using K-means clustering algorithm.
Keywords/Search Tags:search engine, Lucene, Web Crawler, word segmentation algorithm
PDF Full Text Request
Related items