Font Size: a A A

Design And Implementation Of A Job Vertical Search Engine Based On Lucene And Heritrix

Posted on:2011-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2178360302992811Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid expansion of information and information diversity, traditional search engines need to collect, index, query the contents of the constantly expanding. Thus, in the face of search information we need to find the necessary information of the lots of irrelevant information.. The vertical search engine is one of the search engine which is aim on a particular area and will not retrieve a lot of irrelevant information.the vertical engine is higher efficiency, allowing users to query the information they needed more quickly. The vertical search engine is better than the traditional search engine on target, accuracy, less time and performance.With the development, more and more people rely on the network and the huge amount of information and convenient online job search started, so people need to pay attention on the major jobs sites or the sites related jobs. in order to facilitate a better for the network job, this paper designs and build a job vertical search engine based on Lucene and Heritrix.This discussion and study of a vertical search engine for jobs related to principles, techniques and basic implementation process. We use Heritrix crawl to get the job data from some job sites,structure the crawl data,then create and storage the data by lucene, to build a job vertical search engines. The system use MDA (model-driven architecture) to guide the development process in analysis and design phases and use open-source toolkit Heritrix and Lucene to implement the development program. The whole system can be divided into four parts: information extraction module, crawle module, index and store module, the user search module. In crawle module, we design a custom crawler based on the understanding of Heritrix meets the system requirements; in information extraction module using HtmlParser to analysis the information on the web, and we use the concept of location key node to get needed information from the structure information. In the index and store module we use Lucene to index the data and store the data into database, use the way to improve system performance. The search module we design it with three overhead system architecture, and display the information the user needed by a structured form.
Keywords/Search Tags:vertical search, Heritrix, Lucene, HTMLParser, MDA
PDF Full Text Request
Related items