Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine

Posted on:2016-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:N Z Wei

Full Text:PDF

GTID:2298330467991759

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays, more and more organizations and individuals collect and search the information in some particular fields through the vertical search engines. But, under the current circumstances, most vertical search engines just scrap the selected web pages, such as pyspider. Or crawl inefficiency, just collect information by the links among pages, and then determine which pages have the relationship with the topic; The revisit strategies of search engines also cannot meet the timeliness of the information; Most vertical search engines do not consider the condition that characteristics of the topic changes over time; And for small organizations, their crawlers may be refused by some sites.This paper focuses on data collection and data classification of the vertical search engine, designs a new type of topic-based data collection and data integration model, proposes an efficient vertical search engine architecture and implements a new vertical search system. The main contribution of this paper is as follows:1. The crawling strategy is the key to the crawler system. This paper constructs a tree structure to store URL according to the structure of the URL. Based on the transitive relationship between web links, different URLs of different types are given different predicted correlation value. It can help to reduce the capture of unnecessary pages.2. This paper tries another method to find the index pages in the Internet, which have more URLs linking to other pages and lead the users to visit the new or important pages. This system will revisit the index pages regularly instead of revisit all the web pages to find the newest information.3. Determining page relevance is the most effective way to page classification, which is the key to the vertical search engines. In this paper, the structure of the SVM algorithm with feedback mechanisms will ensure that the representation of the theme does not out of date. It will help to crawl more pages which have the valuable information about the topic.4. Design and achieve a new distributed focused crawler systems based on Message Queue. It also will help to reduce the coupling relationship between components. The users can easily expand the system by add servers or tasks.In the end, the paper implements the system, and tests the efficiency of crawler and the classification algorithm of the new system.

Keywords/Search Tags:

Vertical search engine, Focused crawler, Informationgathering, Web page classification

PDF Full Text Request

Related items

1	Research Of Main Technologies Of Vertical Search Engine
2	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
3	The Research On Focused Crawling Algorithm In Vertical Search Engine
4	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
5	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
6	Research On Focused Crawler Technology
7	Research On Focused Crawler Technology Of Vertical Search Engine
8	A Vertical Search Engine In The Field Of News
9	Research On A Method Of Focused Crawler For Vertical Search System
10	Research And Design Of Vertical Search Engine Web Crawler