Font Size: a A A

The Design And Implementation Of The Topic-focused Web Crawler System

Posted on:2020-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhaoFull Text:PDF
GTID:2428330572473695Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the popularization of the Internet and the expansion of network scale,the amount of web data far exceeds the coverage of general search engines.In order to improve the quality of grasping resources,a theme web crawler is generated.During the crawling process,theme web crawlers tend to visit pages with high theme similarity.The traditional theme crawler strategy is usually based on the analysis of web page text content or web page link structure.The theme determination strategy based on the text content of web pages only focuses on the text information of web pages.The theme determination strategy based on the structure of webpage link can predict the theme of webpage through multiple webpage links,but due to the lack of text assistance to determine the relevance of the theme,it often leads to the "theme deviation" of crawler.Although there are related resear-ches on hybrid crawling strategy and subject judgment and crawling algorithm have been improved,there is still room for improvement in terms of recall,precision and crawling speed.In order to improve the theme determination performance of the theme crawler,this paper proposes a text theme similarity determination algorithm based on HowNet.Firstly,the subject similarity evaluation method based on HowNet is proposed.Aiming at the inaccuracy of traditional information content(IC)evaluation method,this paper improves the traditional IC calculation model.In the IC calculation process to increase the upper word,synonyms,synonyms and polysemous processing.Aiming at the problem that vector space model(VSM)vector dimension is too high,a text feature dimensionality reduction method based on HowNet is proposed.After text content is preprocessed by word frequency-inverse document frequency algorith(tf-idf),text vector is further reduced by HowNet.Finally,an improved algorithm for determining the similarity of mixed themes is designed by combining the text content similarity of web pages with the link structure of web pages.This algorithm calculates the text similarity based on the aforementioned HowNet subject similarity evaluation method,and combines the text similarity and PageRank algorithm to calculate the PageRank value of the webpage.Simulation results show that the algorithm can improve the accuracy of theme similarity judgment and avoid the deviation of the web theme from the predetermined theme.Based on the proposed hybrid subject determination algorithm,a subject crawler system is designed and implemented.In this paper,the functional requirements of the system were analyzed in detail,using the WebCollector framework to achieve the crawler function,and using Neo4j and Mysql to persist and store the text and pages related to the theme.The theme crawler system mainly includes web page parsing module,text processing module,theme strategy module and theme comparison text module.The web page parsing module can extract the text content in the web page,the text processing module can preprocess the web page text and convert it into the feature vector of the web page text,the theme strategy module can judge the theme similarity of the web page,and the theme comparison text module can provide the comparison text needed for the theme similarity comparison.The test results show that the system can effectively determine the crawling theme through keywords,and obtain highly relevant comparative text through the comparison text module.It can complete crawling and storage of related theme web pages with seed links as the starting point of URL scheduling,and avoid crawling a large number of irrelevant web pages in the crawling process.In terms of performance,the system has better time performance,concurrency and compatibility.
Keywords/Search Tags:The topic-focused crawler, HowNet, Theme similarity, PageRank
PDF Full Text Request
Related items