Font Size: a A A

Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search

Posted on:2019-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:L G LiuFull Text:PDF
GTID:2428330566968736Subject:Software engineering
Abstract/Summary:PDF Full Text Request
A large amount of data on the Internet provides considerable opportunities and potential for scientific research and product development since we live in era of big data.How to obtain the required data quickly and accurately in a large volume of Internet resources has been a hot topic in network research.The traditional general search engine provides a rough query service,which has been unable to meet the increasingly specialized and personalized search requirements of users.The topic engine only retrieves Internet resources related to a specific topic,and can provide people with more accurate and rapid search services,thus becoming an important development direction of the current search engine.Web crawlers are the core components of search engines.Search engines use crawlers to automatically obtain network data on the Internet,and then index the acquired data for users to query information.There are two types of algorithms in the topic crawler domain.One is link-based,and the other is content-based.This paper first optimizes the link-based HITS algorithm.Then,the content-based Shark-Search algorithm is improved.Finally,a crawler prototype system is implemented based on the two improved algorithms so that people who do not understand crawlers can obtain network data.The main work of this article is summarized as follows:(1)Improvements of the focused crawler algorithm HITS which is link-based.The improved algorithm IHITS is proposed to solve problem that HITS pay more attention to old webpages and ignore the importance of new webpages.IHITS introduces a time-dependent weight P(t)and a website authority function W(i)when judging the importance of a page.The difference between the time when a web page is searched and the time when it was last modified has a very important impact on the value of the web page content.So P(t)assigns different weights to the old and new web pages.W(i)distinguishes between authoritative websites and ordinary websites based on the fact that authoritative websites are cited frequently and that ordinary websites are cited less frequently.By experimenting with network data in a real Internet environment,the experimental results show that IHITS can balance the importance of old and new web pages in the Internet,and can make a better distinction between web pages of different relevance and improve the accuracy of web page ranking.(2)Improvements of focused crawler algorithm Shark-Search which is content-based.Shark-Search is susceptible to noise links when calculating subject relevance and the link context it considers does not adequately determine the relevance of the topic.To solve this problem,this paper presents an improved algorithm ISS.ISS no longer considers this easily disturbing factor of the context of the link and instead considers two factors that are more representative of topic relevance: web page structure and page title.By experimenting with the ODP open catalog that is frequently used in the crawler field,the experimental results show that the ISS can reduce the noise link pair.The influence of correlation calculations improves the anti-jamming ability of crawlers.(3)For many data analysts,they do not understand web crawlers and lack data sources.Based on the above two improved algorithms,this paper implements a crawler prototype system.The system does not require the user to understand the specific principle of crawlers.Simply configuring the system can automatically collect the desired data for the user from the Internet.
Keywords/Search Tags:Web crawler, topic crawling, Shark-Search, HITS, data acquisition
PDF Full Text Request
Related items