Font Size: a A A

Research On A Method Of Focused Crawler Based On Page Partition

Posted on:2018-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2348330518463182Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the diversification of Web information and the expansion of the amount of information is accelerating,not only storage costs,information collection has become increasingly difficult.Common crawlers in the course of work will consume a lot of network bandwidth,resulting in waste of system resources.And it is not too concerned about the search page is consistent with the user's search theme,often will return a lot of users are not interested in the page.Therefore,in order to improve the crawl efficiency,improve the user experience,produced a theme crawler as the core of the vertical search engine.The theme crawler takes a heuristic search strategy in the process of page crawling.By calculating the relevance of the page to the user's search topic,the page not related to the user's search theme will be filtered out,and only the pages related to the topic are saved to the queue to be accessed.Online information is rich and colorful,how to effectively acquire and integrate themed content information and how to use crawlers to download topics completely related to the key page is facing the key technical challenges.Based on researching the results which have been achieved in topic crawler technology,this paper mainly researched web chunking and searching strategies for candidate links.Block layout based on the label information and visual information,proposed a search algorithm inserted a candidate of topic link blocking factor.(1)Block layout based on the label information and visual information.Using the layout rule about < table > tag and the < div > tag,combining visual information in the CSS sheets or < style > properties to processing.First according to design rules of web pages to make classification,the block will be divided into three categories,text block,link block,and blocks without topic.Then extract topic text-block,first using the tag attribute value to filter,and then the similarity calculation of blocks with a benchmark for further filtering,get the final text accord with condition.Use matching rules to extract topic link block,filtering the noise links,obtain the required link adapt to the theme.This paper selects the partitioning method based on the tag attributes and visual information is easy to implement in practical application,avoid a wide blinding between block matching,have lower time and space complexity.(2)When topic crawler climbing out information,need to compute weight and fill into the waiting queue.In this paper,Shark-Search algorithm is presented on the basis of introducing link block weight,puts forward improving strategy for searching based on link-block priority projections.The shark-search algorithm,introducing concept of link block weight,make all anchor text in the children-link as the main influence factors of link relevancy.Implements a search on the page has greatly improve the accuracy and recall ratio of the strategy.(3)In order to ensure the effectiveness of the system,the HITS algorithm,the Shark-Search algorithm and the algorithm are respectively implemented under different thresholds,and the results of the three algorithms are compared and analyzed.The experimental data prove that the system is superior to the other two algorithms under multiple threshold settings.Then,the sum of the recall rate and the sum of the information in the three algorithms are compared in detail,and the subject drift rate of the semantic definition and the abstract concept is analyzed experimentally.The results show that the improved system is better than the traditional theme crawler.
Keywords/Search Tags:Page partition, Visual information, Tag attributes, Block link topic, Shark-Search algorithm
PDF Full Text Request
Related items