Font Size: a A A

Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine

Posted on:2016-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:N Z WeiFull Text:PDF
GTID:2298330467991759Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays, more and more organizations and individuals collect and search the information in some particular fields through the vertical search engines. But, under the current circumstances, most vertical search engines just scrap the selected web pages, such as pyspider. Or crawl inefficiency, just collect information by the links among pages, and then determine which pages have the relationship with the topic; The revisit strategies of search engines also cannot meet the timeliness of the information; Most vertical search engines do not consider the condition that characteristics of the topic changes over time; And for small organizations, their crawlers may be refused by some sites.This paper focuses on data collection and data classification of the vertical search engine, designs a new type of topic-based data collection and data integration model, proposes an efficient vertical search engine architecture and implements a new vertical search system. The main contribution of this paper is as follows:1. The crawling strategy is the key to the crawler system. This paper constructs a tree structure to store URL according to the structure of the URL. Based on the transitive relationship between web links, different URLs of different types are given different predicted correlation value. It can help to reduce the capture of unnecessary pages.2. This paper tries another method to find the index pages in the Internet, which have more URLs linking to other pages and lead the users to visit the new or important pages. This system will revisit the index pages regularly instead of revisit all the web pages to find the newest information.3. Determining page relevance is the most effective way to page classification, which is the key to the vertical search engines. In this paper, the structure of the SVM algorithm with feedback mechanisms will ensure that the representation of the theme does not out of date. It will help to crawl more pages which have the valuable information about the topic.4. Design and achieve a new distributed focused crawler systems based on Message Queue. It also will help to reduce the coupling relationship between components. The users can easily expand the system by add servers or tasks.In the end, the paper implements the system, and tests the efficiency of crawler and the classification algorithm of the new system.
Keywords/Search Tags:Vertical search engine, Focused crawler, Informationgathering, Web page classification
PDF Full Text Request
Related items