Research On An Algorithm Of Focused Crawler In Vertical Search Engine

Posted on:2016-05-30

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhang

Full Text:PDF

GTID:2298330470450508

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, Internet resources areexponential increasing. Exhaustive crawler aims to search whole Internet resourceshave not met precise needs of users in different areas, then the vertical search enginewas arose.Focused crawler is the core of vertical search engine, its crawling quality andefficiency directly determine the performance of vertical search engine. In contrastwith exhaustive crawlers, focused crawlers aim to retrieve the pages that are relevantto specific topics, while simultaneously filtering the number of irrelevant documentson the web, its characteristic are professional, accurate, and in-depth. Traditionalfocused crawlers take the whole pageâ€™s content in predicting the relevance of anunvisited link. The evaluation on the whole pageâ€™s content maybe not accurate due towebpage usually contains multiple topics and not all of them related to a given topic.In this paper, we studies topic relevance algorithms of focused crawler andsearch strategies and propose a focused crawler based on topic boundary around anunvisited link against traditional focused crawlerâ€™s disadvantage. The main researchworks as follows:First, the topic boundary around the unvisited link was identified by drawing2-Dcoordinate and combining with the characteristic of the Dewey decimal classification(DDC). DDC is a hierarchical classification method, each specific topic correspondsto one or more classification numbers due to polysemy. DDC number can determinewhether two words are the same or belonging to the related topic. The topic boundaryaround an unvisited link is a set of key words having similar or same thematicmeaning, mainly includes anchor text and body text. This focused crawler takesanchor text and body text that are similar to meaning of anchor text into account tocalculate the relevance of an unvisited link, avoids noise impact on the outcome.Furthermoreï¼ŒNaive Bayes text classification is built to analysis the topicboundary around an unvisited link and guide focused crawler to crawl. Up to nownaive Bayes classification algorithm is the most effective algorithm in textcategorization. Anchor text is more representative of the thematic meaning ofunvisited link, so anchor text is given a higher weight to highlight the importance ofanchor text keywords when judging.Lastï¼Œthe precision and analogic recall are used as the performance metrics of focused crawlers to compare focused crawler proposed in this paper with othercrawler algorithms in terms of crawling quality. We collect statistics and analysis theoutcomes, the experimental results show that the focused crawler proposed in thispaper is more efficient in improving crawling quality.

Keywords/Search Tags:

vertical search engine, focused crawler, the topic boundary around anunvisited link, Naive Bayes classification algorithm

PDF Full Text Request

Related items

1	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
2	The Design And Research Of Topic Web Crawler In Vertical Search Engine
3	The Research On Focused Crawling Algorithm In Vertical Search Engine
4	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
5	Design And Implementation Of Vertical Search Engine Based On Web Crawler
6	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
7	Customizable Focused Crawler
8	Research And Implementation On Key Techniques Of Topic Search Engine
9	Research And Application Of Vertical Search Engine Key Technologies Based On The Lucene
10	Research On Focused Crawler Technology Of Vertical Search Engine