Font Size: a A A

Research And Implementation Of Topic Crawler System Based On Query Expansion And Multi-Objective Optimization

Posted on:2021-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:C J LiuFull Text:PDF
GTID:2518306308970179Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the further accumulation of information on the Internet,traditional web crawlers have been unable to meet people's needs for personalized and real-time access to information,and topic crawlers have emerged as the times require.Compared to traditional web crawlers,topic crawlers have clear topic descriptions as crawling targets,and intelligent link evaluation to optimize the path of topic crawling,thereby achieving higher efficiency.However,current methods of topic description are difficult to achieve a balance between construction cost and completeness,and current topic crawling is also difficult to coordinate the relationship between multiple factors that affect link priority.Aiming at these problems,this paper proposes a topic crawler system based on query expansion and multi-objective optimization.This thesis uses query expansion to enhance the completeness of the original topic description,and uses the improved TextRank algorithm to extract topic keywords in the iterative query results to enrich the topic model.Firstly,based on the pre-trained word vectors of the BERT model,the topic relevance factor is introduced into the TextRank algorithm's transfer weight matrix,so the Topic-TextRank algorithm is proposed to improve the extraction of topic keyword.Then,combined with the iterative process of correlation feedback and pseudo-correlation feedback,a dynamic fusion is made between the keyword result weight of the Topic-TextRank algorithm and the query ranking.Finally,based on dynamic Topic-TextRank algorithm,an extended framework using relevance description and an extended framework using pseudo-correlation feedback are proposed,and experiments are performed to verify the improvement of the effectiveness of these two frameworks on theme description.This thesis abstracts the topic crawling process into a multi-objective optimization problem,with the factors that determine the link priority are ed as objective functions,and then the improved Ant Colony Algorithm and the improved NSGA-? algorithm are used to solve it.For the Ant Colony Algorithm,this thesis divides the pheromone in the Ant Colony Algorithm into gain pheromone and penalty pheromone according to the relevance of the webpage.Based on these two kinds of pheromone and the influence of the points on the paths of ant colony on the forward multi-segment paths,a backtracking-update algorithm of pheromone for the ant colony is proposed.For the NSGA-? algorithm,a weighted calculation method of the crowding distance is introduced to optimize the final elite selection.Finally,combining these two improved algorithms,a multi-objective optimization-based topic crawling strategy is proposed,and experiments are performed to verify its improvement on the accuracy and efficiency of topic crawlers.This thesis develops and implements a topic crawler system based on query expansion and multi-objective optimization to achieve accurate,comprehensive and efficient crawling of target topics.This system includes topic description module,topic crawling module and data storage module.Based on query extension,the topic description module obtains topic model and seed web pages.The topic crawling module realizes the crawling process based on the multi-objective optimization.The data storage module uses redis and MySQL to store the intermediate datas and the result web pages.
Keywords/Search Tags:query expansion, topic description, multi-objective optimization, topic crawler
PDF Full Text Request
Related items