Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine

Posted on:2015-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:H Chen

Full Text:PDF

GTID:2268330428467673

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the size of the network information resources has become extremely large. It is becoming increasingly difficult to search the information quickly and accurately in the vast amounts of network information resources. At this moment, the search engines emerged as the times required. Search engines can provide users with a greatly convenient when they search something, so they are widely used in people’s daily life. Web crawler is the core module of a search engine, who is responsible for collecting all kinds of web pages on the network. The web crawler’s crawl strategy and performance greatly influences the service quality of a search engine, as a result, web crawler is worthy of research and improvement. Due to the huge network scale and timely response to user requirements, general search engines often provide users with inaccurate results, they cannot satisfy users. The vertical search engine is a new generation of search engine that can provide more detailed and accurate search service. The research object of this paper is the focused crawler in the vertical search engine. Focused crawler focuses on the information colletion of specific areas, it has a higher acquisition efficiency. Focused crawler has high research value and use value, it offers a new way for the development of web crawler.In this paper, we first outlined the development of search engines and the research situation of web crawlers, studied the basic principle and working process of the search engines, and then deeply discussed key technologies in the focused crawlers. Finally, based on the theories above, the paper gives an engineering implementation of focused crawler system.In the crawl strategy of the focused crawler system, the paper learn the algorithm process from the Fish-Search algorithm and Shark-Search algorithm. Based on them, the paper dynamically adjust the topic relevancy threshold to overcome the "tunnel" between the groups of topic web pages. At the same time, the paper referenced the mature text analysis, namely TF-IDF algorithm in the Vector Space Model, and designed an improved method to calculate the web page topic relevancy and URL topic relevancy. In the terms of web page text extraction, the paper utilized the label tree structure of the web page to calculate the denisity of text/label,and then extracted the text of the page. Later experiments showed that compared with the focused crawler implemented by the traditional way, though the focused crawler implemented in this paper had a slightly lower harvest rate, it can get a higher coverage rate and make a good blance between them.

Keywords/Search Tags:

Vertical search engine, Focused crawler, Topic relevancy, Crawlerstrategy, Text extraction

PDF Full Text Request

Related items

1	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
2	The Research On Focused Crawling Algorithm In Vertical Search Engine
3	Customizable Focused Crawler
4	Research And Application Of Vertical Search Engine Key Technologies Based On The Lucene
5	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
6	Research Of Main Technologies Of Vertical Search Engine
7	Technology Research, Based On Focused Crawling Of Web Information Collection
8	Vertical Search Engine Based Public Opinion Alert And Analysis Platform
9	Research On Focused Crawler Technology Of Vertical Search Engine
10	A Vertical Search Engine In The Field Of News