Font Size: a A A

Research On Topic Search And Its Key Algorithm

Posted on:2019-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:X LvFull Text:PDF
GTID:2428330572951737Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the era of rapid development of modern Internet technologies,various types of resources on the Internet have shown explosive growth.In the vast amount of network information,how to quickly and accurately find out the information required by users is also becoming increasingly difficult.Although the general search engine is widely used,it is difficult to meet the needs of special users who want to accurately obtain the required information.So when users want to get the information they need,they need a vertical search engine.The topic web crawler strategy is the core of the vertical search engine.It can crawl only pages that are highly relevant to the topic during page search.However,the existing algorithms for traditional topic crawlers also have shortcomings,such as the thematic drift phenomenon,lack of consideration of the overall situation,etc.The research in this thesis is based on the development of a military information retrieval system for a research institute,focusing on research on topic crawler search strategies.For military-related pages,combining the advantages and disadvantages of different algorithms to improve the Page Rank algorithm,so that the improved algorithm will show better performance when crawling pages.The main research content of this thesis is as follows:First of all,research on relevant theories and technologies in web crawlers.In this part,we mainly analyze the difference between the general crawler and the topic crawler system.And then we analyze the related technologies used in the implementation of topic crawler,and mainly analyzing the page processing and correlation calculation.Then,in view of grabbing the military theme page,we analyze the Page Rank algorithm.We find that this algorithm is easy to ignore new pages and occurring the theme drift,especially in the time-sensitive web pages.Therefore,this paper proposes an improved strategy for Page Rank algorithm to ignore the new page: when crawling the military-like subject page,the time factor is introduced into the algorithm.Through the treatment of time,the original algorithm reduces the calculated value of the old page when calculating the Page Rank value,and eliminates the defect in the calculation of the new and old pages.In view of the crawling producing theme drifting,we propose an improved strategy: using the Shark-Search algorithm considering the relevance of pages when guiding the crawler crawling,we combine the improved Page Rank algorithm with Shark-Search algorithm.The combined algorithm can eliminate the theme drift phenomenon through grabbing the topic related pages.Finally,evaluate the effectiveness of the improved new algorithm through experiments.We use the precision and recall to evaluate the algorithm.Because of the total number of pages that are related to themes on Internet is hard to get,and it is a constant value.Therefore,this thesis uses the number of pages related to the topic to crawl instead of the recall rate.Through experimental verification,the improved algorithm shows good performance in the precision and recall of military-related topic-related pages.And this algorithm is applied to the information retrieval system.Compared with the Baidu index,the information retrieval system shows the topic relevance when treating the index content.
Keywords/Search Tags:Topic Crawler, Crawling Strategy, Shark-Search Algorithm, PageRank Algorithm, Time Factor
PDF Full Text Request
Related items