Research And Implementation Of Domain Based Web Crawler

Posted on:2018-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:J F Liu

Full Text:PDF

GTID:2348330512489091

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the development of the internet,the search engine is the necessary tool for people in their lives,work and study.These general search engines provide us with a powerful information search function.However,as the larger quantity of information,the increasing number of users and the increasing demand for refined,the shortcomings of traditional general search engine are becoming more and more prominent.For example,the search result is not deep enough,and it does not match the specific demand.Domain based vertical search engine emerges as the times require.The web crawler based on the field,is targeted at particular industry information for the purpose of grasping,with special,fine features.This thesis firstly debates the types and current situation of the existing search engines,and then explains the shortcomings of the traditional search engine and the main grasping strategies and the algorithms.The thesis expounds the research focus of the crawler based on the field.Secondly,this thesis has carried on the detailed explanation to the current popular open source web crawlers' architecture,and on the basis,selects Heritrix+Lucene to build the crawler platform for the field of mobile.After that,this thesis analyzes the design structure of Heritrix and extenses the Heritrix crawler,and makes some improvement of the source code,and then,eliminates some design flaws of the Heritrix crawler.In view of the shortcomings of traditional search engines,this thesis proposes a new crawling strategy for the domain based crawler.The thesis introduces the concept of semantic influence when using VSM model to calculate the text similarity and proposes a domain topic crawling strategy based on Shark-Search algorithm.The topic crawling strategy constructs ontology model based on the field of mobile and semantic matrix to compute the page similarity.According to the location information of the title,meta,anchor file,context and so on,this thesis sets up different weight of the text which can refine the calculation of domain topic similarity.The thesis has improved the PageRank algorithm and considered effect of the parent page importance on the child page.Finally,it comprehensively sorted the URL queue combined similarity caculation method mentioned above in order to avoid the domain topic drift problem and the tunnel phenomenon of traditional search.The thesis uses Java language to achieve the expansion of the crawler code,and finally,analyzes the effect of crawling.And it gives a debate on the points which can be improved in the future.

Keywords/Search Tags:

Heritrix, ontology, domain crawler, VSM, PageRank

PDF Full Text Request

Related items

1	Focused Crawler Based On Ant Colony Research And Implementation
2	Research Of Uighur Information Search Engine Based On Heritrix
3	A Web Crawler System For Professional-town Information Based On Heritrix Framework
4	Research And Implementation Of Topic Crawler Based On Domain Ontology
5	Research Heritrix And Vertical Search Engine Based On Lucene
6	Research And Implementation Of On Semi-automatic Ontology Construction Base On WordNet And Focused Crawler
7	Research On Ontology - Based Learning Resource Construction Model And Its Application
8	Research And Design Of Vedio Tutorial Base On Theme Crawler
9	Design And Implementation Of The Focused Crawler System Based On Customized Domain Conceptions
10	Research On Focused Crawler Technology Based On Domain Ontology And Multi-objective Ant Colony Optimization Algorithm