Font Size: a A A

Research On An Algorithm Of Focused Crawler In Vertical Search Engine

Posted on:2016-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2298330470450508Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, Internet resources areexponential increasing. Exhaustive crawler aims to search whole Internet resourceshave not met precise needs of users in different areas, then the vertical search enginewas arose.Focused crawler is the core of vertical search engine, its crawling quality andefficiency directly determine the performance of vertical search engine. In contrastwith exhaustive crawlers, focused crawlers aim to retrieve the pages that are relevantto specific topics, while simultaneously filtering the number of irrelevant documentson the web, its characteristic are professional, accurate, and in-depth. Traditionalfocused crawlers take the whole page’s content in predicting the relevance of anunvisited link. The evaluation on the whole page’s content maybe not accurate due towebpage usually contains multiple topics and not all of them related to a given topic.In this paper, we studies topic relevance algorithms of focused crawler andsearch strategies and propose a focused crawler based on topic boundary around anunvisited link against traditional focused crawler’s disadvantage. The main researchworks as follows:First, the topic boundary around the unvisited link was identified by drawing2-Dcoordinate and combining with the characteristic of the Dewey decimal classification(DDC). DDC is a hierarchical classification method, each specific topic correspondsto one or more classification numbers due to polysemy. DDC number can determinewhether two words are the same or belonging to the related topic. The topic boundaryaround an unvisited link is a set of key words having similar or same thematicmeaning, mainly includes anchor text and body text. This focused crawler takesanchor text and body text that are similar to meaning of anchor text into account tocalculate the relevance of an unvisited link, avoids noise impact on the outcome.Furthermore,Naive Bayes text classification is built to analysis the topicboundary around an unvisited link and guide focused crawler to crawl. Up to nownaive Bayes classification algorithm is the most effective algorithm in textcategorization. Anchor text is more representative of the thematic meaning ofunvisited link, so anchor text is given a higher weight to highlight the importance ofanchor text keywords when judging.Last,the precision and analogic recall are used as the performance metrics of focused crawlers to compare focused crawler proposed in this paper with othercrawler algorithms in terms of crawling quality. We collect statistics and analysis theoutcomes, the experimental results show that the focused crawler proposed in thispaper is more efficient in improving crawling quality.
Keywords/Search Tags:vertical search engine, focused crawler, the topic boundary around anunvisited link, Naive Bayes classification algorithm
PDF Full Text Request
Related items