Font Size: a A A

Design And Implementation Of Vertical News Search Engine Based On Heritrix

Posted on:2018-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:R D YangFull Text:PDF
GTID:2428330566989556Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the coming of information age,information on the Internet becomes more and more complicated.General search engines have been difficult to provide enough accurate results to users when they search in some special fields.Under these circumstances,vertical search engine become a better choice because of it has the characteristics of high precision and professional.This thesis design and implement a vertical search engine based on Heritrix and Lucene.The search engine is divided into four modules: crawler module,preprocessing module,index module and retrieval module.By analyzing the source code of Heritrix,this thesis extends the modules of Heritrix Frontier to realize the logic of crawler.At the same time,using ELFHash to optimize the downloading thread allocation,which can improve the speed of Heritrix when downloading web pages.Using regular expressions to filter the webpages.Then using Jsoup to extract structured information from the webpage based on the document object model tree,and store the structured information to the database.Using Lucene create index files and retrieval to the user query information from the index files.In order to improve the effectiveness of Chinese search,using Chinese words segmentation IK to replace the default Chinese words segmentation tool in Lucene.Finally,we build a Web service platform with tomcat for users to search news.
Keywords/Search Tags:Vertical Search Engine, Heritrix, Lucene, Web Crawler, Structured information extraction
PDF Full Text Request
Related items