Font Size: a A A

Research On Focused Crawler Based On Improved Shark-Search Algorithm

Posted on:2020-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:L XuFull Text:PDF
GTID:2518305732476814Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The phenomenon of information overload in the Internet poses a huge challenge for people to obtain interesting content from the network.The traditional search engine can meet the needs of people's retrieval of network content to a certain extent,but there are also problems such as information mismatch and uncorrelated search results.For this reason,people have developed vertical search engine to meet the user's more sophisticated search needs in specific areas.The focused crawler is a core component of the vertical search engine.It collects network data of specific fields for vertical search engines.This paper takes focused crawler as the main research content.The focused crawler crawls the webpage through predefined topic,filters the webpages that are not related to the topic during the crawling process,and predicts the order and route of webpage crawling by using a specific search strategy,so as to reduce the irrelevant visit and crawl more relevant webpages.Shark-Search algorithm is a heuristic focused crawler algorithm based on web content,which is widely used because of its simple implementation and high crawling efficiency.However,Shark-Search algorithm also has some shortcomings,such as low accuracy of topic discrimination,shortcomings of "myopia proble" and "tunnel problem" in the crawling process.This paper introduces the principle and implementation details of topic crawler in an all-round way,and improves Shark-Search in two aspects:topic discrimination and crawler search strategy,in order to improve its performance.The main works are as follows:1.For the vector space model,only the keyword matching is used to calculate the correlation and the semantic information between the words is neglected.This paper proposes a new topic relevance calculation model:Using word2vec and topic model to construct topical word vector to extend the semantics of words;Combine the semistructured features of web pages to improve TF-IDF algorithm and extract webpage keywords;The web pages and topics are transformed into weighted average representation of corresponding keyword vectors,and the topic correlation is calculated by cosine distance.Experiments show that the correlation calculation model proposed in this paper is superior to the space vector model in topic discrimination of web pages;2.According to the content aggregation principle of the Internet,the Shark-Search algorithm only considers the content attribute of the link and ignores the network structure attribute of the link.Based on the content aggregation principle of the Internet,this paper proposes a link evaluation method based on url clustering.The score on the network structure,together with the original link content score of the Shark-Search algorithm,constitutes the final link score,which not only solves the "myopia problem" of the Shark-Search algorithm but also reduces the error rate caused by the missing anchor text,and also prevents the topic drift phenomenon;3.In order to better solve the tunnel problem and expand the coverage area of crawlers,this paper optimizes the tunnel crossing mechanism in Shark-Search algorithm,determines the hub web pages with the idea of HITS algorithm,and formulates different tunnel crossing strategies for different types of web pages.Optimized algorithm can stop irrelevant search and improve the success rate of long tunnels compared with Shark-Search;4.The topic relevance calculation model proposed in this paper is introduced into the Shark-Search algorithm to replace the original vector space model.Combined with the above improvement of the Shark-Search algorithm,a focused crawler based on the improved Shark-Search algorithm is proposed.Compared with other focused crawler algorithms,the algorithm in this paper is more than 5%higher in precision and harvest rate,which verifies the effectiveness of the proposed focused crawler algorithm.
Keywords/Search Tags:focused crawler, topic discrimination, Shark-Search algorithm, link evalua-tion
PDF Full Text Request
Related items