Font Size: a A A

Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine

Posted on:2015-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2268330428467673Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the size of the network information resources has become extremely large. It is becoming increasingly difficult to search the information quickly and accurately in the vast amounts of network information resources. At this moment, the search engines emerged as the times required. Search engines can provide users with a greatly convenient when they search something, so they are widely used in people’s daily life. Web crawler is the core module of a search engine, who is responsible for collecting all kinds of web pages on the network. The web crawler’s crawl strategy and performance greatly influences the service quality of a search engine, as a result, web crawler is worthy of research and improvement. Due to the huge network scale and timely response to user requirements, general search engines often provide users with inaccurate results, they cannot satisfy users. The vertical search engine is a new generation of search engine that can provide more detailed and accurate search service. The research object of this paper is the focused crawler in the vertical search engine. Focused crawler focuses on the information colletion of specific areas, it has a higher acquisition efficiency. Focused crawler has high research value and use value, it offers a new way for the development of web crawler.In this paper, we first outlined the development of search engines and the research situation of web crawlers, studied the basic principle and working process of the search engines, and then deeply discussed key technologies in the focused crawlers. Finally, based on the theories above, the paper gives an engineering implementation of focused crawler system.In the crawl strategy of the focused crawler system, the paper learn the algorithm process from the Fish-Search algorithm and Shark-Search algorithm. Based on them, the paper dynamically adjust the topic relevancy threshold to overcome the "tunnel" between the groups of topic web pages. At the same time, the paper referenced the mature text analysis, namely TF-IDF algorithm in the Vector Space Model, and designed an improved method to calculate the web page topic relevancy and URL topic relevancy. In the terms of web page text extraction, the paper utilized the label tree structure of the web page to calculate the denisity of text/label,and then extracted the text of the page. Later experiments showed that compared with the focused crawler implemented by the traditional way, though the focused crawler implemented in this paper had a slightly lower harvest rate, it can get a higher coverage rate and make a good blance between them.
Keywords/Search Tags:Vertical search engine, Focused crawler, Topic relevancy, Crawlerstrategy, Text extraction
PDF Full Text Request
Related items