Font Size: a A A

Research And Implementation Of Topic-Oriented Seach Engine Based On Lucene

Posted on:2009-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z L ZhangFull Text:PDF
GTID:2178360272474003Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information modernlization, the amount of information has been rising in unimaginable speed. World Wide Web has been becoming the important resource of the vast information and information trek and information overload has been a serious problem increasingly. Various types of search engine based on Internet came into being and developed rapidly. But when genernal SE is used to retrieval accurate information, they are often unable to meet what users want.Topic-Oriented Search Engine is a kind of Search Engine that can retrieval accurate result on structure data. With the explosive growth of information, as well as the development on a wide range, subject-oriented search engine is becoming a hot research and development trend. In this paper, the main technologies are researched and the Topic-oriented SE about mobile phone is bulit, used Lucene.This article analyzes the basic principles and strategies of the information collected, some of the classic search engine's algorithms were conducted in-depth analysis; In order to improve the quality of oriented-topic information, improved algorithms Shark, to a certain extent, resolved the issue of drifting theme. Analysis of the full-text search package Lucene, to explore the use of the Lucene VSM, an analysis of the Lucene index file structure and Documents score algorithm; Analysis of the Lucene used in the inverted index technology and how to improve the performance of Lucene index was discussed; Analysis of the documents Lucene score algorithm, an example was taken to explore the various factors that affect the scores of documents; Analysis of the Lucene, Heritrix part of the core code. In light of the actual needs, expanded FrontierSchedular of Heritrix, A strategy was set to choose URL to achieve the precise theme of the web crawler; the use of Expression and HtmlParser package design modes to get the accurate mobile information; According to the basic principles of elimination of redundancy page, design and realization of Class basically resovled problem of redundancy page; JE expanded function of the word Lucene module, Lucene was make up for a failure of Chinese word segment in term of phraseBy researching the main technoloties of topic-oriented search engine, a topic-oriented search engine was disigned and built about phone information. The project used a crawler named Heritrix to collect web page of topic. Heritrix is a flexiable crawler and we can expand its function to meet our needs. Then Lucene was chosen to index and retrieval the web page of topic-oriented. At the end, UI was disigned and realized used Spring Framework and DWR Framework., users can search the index library to gain the results you want; UserInterface has friendly to users, the system has a better recall and precision and the results of test basically reach the desired goal.
Keywords/Search Tags:Lucene, Oriented-topic Search Engine, Crawl Algorithm, Retrieval, Index
PDF Full Text Request
Related items