Font Size: a A A

Research And Implementation Of Domain Based Web Crawler

Posted on:2018-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:J F LiuFull Text:PDF
GTID:2348330512489091Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the internet,the search engine is the necessary tool for people in their lives,work and study.These general search engines provide us with a powerful information search function.However,as the larger quantity of information,the increasing number of users and the increasing demand for refined,the shortcomings of traditional general search engine are becoming more and more prominent.For example,the search result is not deep enough,and it does not match the specific demand.Domain based vertical search engine emerges as the times require.The web crawler based on the field,is targeted at particular industry information for the purpose of grasping,with special,fine features.This thesis firstly debates the types and current situation of the existing search engines,and then explains the shortcomings of the traditional search engine and the main grasping strategies and the algorithms.The thesis expounds the research focus of the crawler based on the field.Secondly,this thesis has carried on the detailed explanation to the current popular open source web crawlers' architecture,and on the basis,selects Heritrix+Lucene to build the crawler platform for the field of mobile.After that,this thesis analyzes the design structure of Heritrix and extenses the Heritrix crawler,and makes some improvement of the source code,and then,eliminates some design flaws of the Heritrix crawler.In view of the shortcomings of traditional search engines,this thesis proposes a new crawling strategy for the domain based crawler.The thesis introduces the concept of semantic influence when using VSM model to calculate the text similarity and proposes a domain topic crawling strategy based on Shark-Search algorithm.The topic crawling strategy constructs ontology model based on the field of mobile and semantic matrix to compute the page similarity.According to the location information of the title,meta,anchor file,context and so on,this thesis sets up different weight of the text which can refine the calculation of domain topic similarity.The thesis has improved the PageRank algorithm and considered effect of the parent page importance on the child page.Finally,it comprehensively sorted the URL queue combined similarity caculation method mentioned above in order to avoid the domain topic drift problem and the tunnel phenomenon of traditional search.The thesis uses Java language to achieve the expansion of the crawler code,and finally,analyzes the effect of crawling.And it gives a debate on the points which can be improved in the future.
Keywords/Search Tags:Heritrix, ontology, domain crawler, VSM, PageRank
PDF Full Text Request
Related items