Font Size: a A A

Research On The Key Technology And Implementation Of The Focused Crawler Based On Shark-Search And OTIE Adaptive Algorithm

Posted on:2020-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:H S PengFull Text:PDF
GTID:2428330620453998Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data with explosive growth,Internet data has become the most important resource in the 21 st century,and it is also the cornerstone of the development of artificial intelligence.How to obtain useful data from Internet resources accurately and quickly becomes a hot topic of current research.The general search engine tries to query and return more data for the user,but the returned data does not require the user to occupy the majority of the data.The topic search engine improves the speed and accuracy of the search engine by retrieving the topic information given by the user,and thus becomes the main direction of search engine research.Web crawlers are an important part of search engines and are designed for search engines to access network resources from the Internet.The three classic algorithms in the topic crawler field include: link-based algorithms,content-based algorithms,and algorithms that combine links with content.This paper first optimizes the shortcomings in the content-based Shark-Search algorithm,and then solves the problems in the OTIE adaptive algorithm based on the combination of link and content.Finally,the theme crawler system is implemented based on two improved algorithms..The main research contents are as follows:(1)Improvement of the content-based theme crawler algorithm Shark-Search.In order to calculate the sub-link topic relevance,the Shark-Search algorithm is susceptible to the lack of sub-link context information and noise links.An improved algorithm ESS(Enhance Shark-Search)is proposed.First of all,the ESS algorithm no longer performs similarity calculation with simple keyword and sub-link context content,but uses iterative extension-filtering technology to expand the topic words to obtain a more comprehensive topic-related word set,which can effectively reduce The impact of insufficient information.Secondly,the ESS algorithm eliminates the noise link by introducing the pre-judgment weight.The pre-judgment weight is obtained by acquiring the CSS style,anchor text,picture label and other features of the sub-link in the webpage and simultaneously calculating the weight corresponding to each feature.The introduction of weights and pre-judgment weights has a significant effect on reducing the impact of noise links.Experiments were carried out by crawling data from 4 different topics from Sina.com.The experimental results show that the precision of the ESS algorithm is 12.1% higher than the original algorithm,and the recall rate is 12.08% higher than the original algorithm.(2)Improvement of the OTIE adaptive algorithm based on the combination of link and content.For the OTIE adaptive algorithm,the balance between the old and new web pages is not fully considered.The crawler program raises the number of new webpages due to the poor distribution of the cash value of the webpage during the process of crawling webpages,and proposes an improved adaptive algorithm E-OTIE.The E-OTIE adaptive algorithm introduces a time-dependent weighting factor when determining the importance of a web page.The weighting factor is the time difference between the latest modification of the webpage and the crawled time.If the time difference is larger,the older the webpage,the lower the corresponding weight.The introduction of time weights has a significant effect on balancing new and old web pages.Experiments with data crawled from the Internet show that the average harvest rate and average recall rate of the E-OTIE adaptive algorithm are close to the original algorithm,and the new web page harvest rate of the algorithm is increased by about 23%.(3)Based on the above research,this paper will implement a prototype system of the topic crawler.The user simply configures it in the system interface according to his needs,and then grabs the data that meets the requirements.
Keywords/Search Tags:Theme crawler, Shark-Search, OTIE, data acquisition, adaptive
PDF Full Text Request
Related items