Font Size: a A A

The Topical Web Crawler Research In Vertical Search Engine

Posted on:2014-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:J M LiuFull Text:PDF
GTID:2248330398457652Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the information is growing explosively on the web. The general search engine can’t meet people’s demand for personalized information; at the same time, domain-specific vertical search engines have emerged. Vertical search engines meet the demand of the specific people from the specific areas. They could provide the high-quality, personalized information for users. Topical crawler is an important part of a vertical search engine. The topical crawler according to the topic customized by the user intelligently crawls the web, and the topical crawler only gets the relevant resources to the topic and filters the irrelevant resources. It provides the data for the vertical search engine.The paper analyzes the topical crawler system architecture, technical principles, focuses on the topic description, the topic relativity measurements and topic search algorithms. The main work of this paper is as follows:1) The general keywords based topic description model is not accuracy and comprehensive enough. To conquer these problems, the paper presents a strategy that the description keywords dynamically expanding. First, we need to build a basic set of keywords; Second, the paper propose a TF-1DF(Term Frequency-Inverse Document Frequency) algorithm based location, words weighted according to the position of the text on the page. So, we can extract features from the page by the improved algorithm; Finally, we combine the similarity of the page with the topic with words math frequencies to expand the feature words to the topical library while the topical web crawler is crawling the web. As a result, the keywords library using description of the topic can be more comprehensive and accurate.2) The paper analyses the shark search algorithm. We improve the shark search algorithm against the deficiency. Shark search algorithm uses the links context information, but the links context information is often filled with noise. So,it will interfere with the link prediction. In this paper, we use the URL string itself instead of link context information. Because the URL string can represent the page content which it points to. We can use some heuristic information which can be analyzed from the structure of the URL to translate the URL string to the recognized text information. Once we get the text information, it is conducive to calculate these information with the similarity of the topic. Since the shark search algorithm is greedy, therefore, it is difficult to find a global solution in the web diagram. This paper introduces the tunnel analysis techniques to solve the greed of the shark search algorithm. The paper combines the keywords dynamic expanding algorithm with the improved shark search algorithm to improve the performance of the topic crawler. The experiments show that precision and recall of the topic crawler increase. Therefore, the method the paper proposed is effective.
Keywords/Search Tags:topical crawler, vertical search engine, topic description, topic prediction
PDF Full Text Request
Related items