Font Size: a A A

Research On Focused Crawler Based On Page Segmentation

Posted on:2017-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:J M ZhangFull Text:PDF
GTID:2348330503989898Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of information on the internet, general search engines cannot meet the users of different background which has high requirement for query result recall rate, vertical search engines can focus on special topics, crawl and retrieve topic-related pages more comprehensively. Vertical search engines have been widely used in various fields, focused crawler is one of the core components of vertical search engine, and has become a hot research field in recent years.Focused crawler need to crawl pages relevant to topics, thus the relevance calculation and prediction is the core issue of it, it has three main aspects: page analysis, page relevance computation, link priority computation. For page analysis part, a new page content extraction method based on page region segmentation is proposed. The method use the repeated tag format to separate page into several records, filter noise nodes on its format feature and recognize content record based on the position of text headline and the length of content in each record. For page relevance computation part, the paper also proposes a voting algorithm which combines the results from URL classifier and text classifier on classification algorithm. For link priority computation, classifier is used to filter off-topic pages and off-topic regions, and then we combine link structure information and classifier results to compute link priority.Experiment result shows that the text extraction method based on page record detection can accurately extract main text in page, outperforming one state-of-art approach. And the voting algorithm which combines two classifiers can better classify pages of different topics. The link priority computation method which combines features of link structure and features of page region can improve harvest ratio of focused crawler. To optimize the work further, we will try to train classifier online and accelerate classification speed.
Keywords/Search Tags:Focused crawler, Page region segmentation, Page classification, Link priority prediction
PDF Full Text Request
Related items