Font Size: a A A

Research On Search Strategy And Key Techniques Focused Crawler Based On Bionics

Posted on:2020-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:P JiangFull Text:PDF
GTID:2370330590971969Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Focused crawler is a key component of the topic search engine,its purpose is to retrieve the maximum number of pages related to a designated topic.It filters web pages based on relevant algorithms or specific strategies until a certain number of downloads,iteration number,or the accuracy threshold of the topic similarity is reached.Compared with the general crawler,the focused crawler needs to solve the following problems: definition of the topic,analysis the web page data,search strategy of unknown URLs.For the previous two studies has also become more accurate and comprehensive.The search strategy of unknown URLs has always been a hot topic and difficulty in the field of focused crawler research.Research in this area ranges from initial content-based and links to the use of thesaurus and ontology,until now,most studies have been based on machine learning algorithms.The search accuracy and coverage of the focused crawler have been improved.However,the current search strategy for focused crawler also has problems such as low accuracy of subject calculation,low coverage of crawling web pages,prone to topic shift,and unreasonable selection of seed pages.These issues have been studied in this thesis.In order to solve the above problems,this thesis studies the topic crawler search strategy and related technologies:1.This thesis proposes a focused crawler framework based on mutation improving particle swarm optimization First of all,for each topic,this thesis gets 3 different types of seed pages.Then,this thesis calculate the weights of the three seed pages for each topic,and use the weights as the initial velocity and direction values of the crawler.The improved algorithm in this thesis sets the global extremum to a value that is perfect but does not actually exist.Thus the influence of the global extremum is neglected and the idea of variation is added when the algorithm falls into local convergence.Finally,the experimental indicate that focused crawler can obtain more accurate URLs priority and crawl high quality web pages than other crawlers.Therefore,the focused crawler framework proposed in this thesis is effective and important.2.This thesis builds a seed page selection framework based on a community detection algorithm.First,this thesis is based on the search engine to obtain a certain number of initial related seed pages.These pages are used as nodes,which are community partitioned by the improved Louvain algorithm.Determine the node size by calculating the normalized mutual information of the initial partitioned community,and deleting duplicate nodes to construct a supernode network.Finally,the super network node page weight is obtained by calculating the similarity between the content of the node page and the topic.After removing repetitions,pages with weights are larger than the threshold construct the seed set.Through experimental analysis,the framework demonstrates that the proposed framework can improve the performance of the focused web crawler in terms of both accuracy and coverage.
Keywords/Search Tags:mutation idea, particle swarm algorithm, focused crawler, louvain algorithm, seed selection strategy
PDF Full Text Request
Related items