Font Size: a A A

Topic Crawler Based On Improved VIPS Algorithm And Improved Grey Wolf Optimization Algorithm

Posted on:2020-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:J J XiaoFull Text:PDF
GTID:2428330596968143Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the resources of web pages are growing explosively.Accurate access to the web pages that users need is a hot issue to be solved by search engines.In this case,the vertical search engine for specific topics was born.Its core is topic crawler.This paper introduces the concept of web page partitioning and combines the link evaluation method of topic crawler to propose a more effective topic crawler.The main work of this paper is as follows:(1)A new web page partitioning algorithm is proposed.The partitioning rules of the VIPS algorithm are further optimized under the existing page design structure based on "DIV+CSS".According to the needs of topic crawlers,text blocks and link blocks are extracted,and irrelevant links and spam information in web pages are filtered out.Then the feature keywords are extracted from the text information in the topic block,and the topic correlation is calculated by using the vector space model after weighting the feature keywords using the improved TF-IDF weighting algorithm.Content analysis method based on Web page partitioning provides higher quality web page URLs for subsequent link evaluation,while reducing the impact of irrelevant content.(2)Theme crawler must calculate the priority of Web links to determine the crawling direction of the theme crawler.In this paper,the basic idea of swarm intelligence algorithm is used to introduce grey wolf optimization algorithm.By adding the concept of dynamic weight and changing the calculation method of convergence factor,the improved grey wolf optimization algorithm is applied to topic crawler.The accuracy of link priority is higher.At the same time,the problem of topic crawler falling into "local optimum" is avoided,and the global search ability is improved,and unrelated links can be abandoned.Removing unrelated links improves the quality of returned pages.Experiments show that the improved grey wolf optimization algorithm can significantly improve the accuracy of the crawler.(3)The two methods are combined to design a theme crawler system.After data preparation and parameter setting,the theme crawler system used in this paper is compared with the theme crawler system based on PageRank algorithm and the theme crawler system constructed by shark-search algorithm.The coverage,accuracy and sum of information of three subject crawlers are analyzed and compared in detail.The results show that the subject crawler system proposed in this paper has better performance.
Keywords/Search Tags:Topic Crawler, Keyword Extraction, Web Page Segmentation, Topic Relevance, Grey Wolf Optimization Algorithm
PDF Full Text Request
Related items