Font Size: a A A

Focused Crawler Based On Incremental Bayes Algorithm

Posted on:2019-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:S S WangFull Text:PDF
GTID:2428330545485302Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Using traditional web crawlers,we can get huge data on the Internet.However,the information on the world wide web show an explosive growth.,This causes that a large part of acquired data is useless for users,Therefore,how to return more useful data has attracted the attention of researchers.Focused crawler can Return more relevant data by predicting the correlation between links and topics;The calculation of topic relevance and prediction of link priority are the main factors that influence the performance of focused crawler.This paper improves the performance of focused crawler by improving the formula of the link priority calculation and introducing the incremental Bayes algorithm to calculate the correlation between text and topics.The main work of this paper is as follows:1)This paper applies the incremental learning idea to the Naive Bayes classification algorithm(NB),which is called ILNB.In the process of training the Bayes model,half of the training data is used to generate the initial model,and the initial model is improved according to the prediction results of remaining training data.The experimental results of the ILNB and the NB algorithm are displayed and compared respectively which show the effectiveness of the ILNB algorithm.2)This paper proposes a new method of computing priority of links:using url text to determine the priority of url with anchor-text and page text.The experiment results of different parameters and the use of url text for focused crawler show that the introduction of url text to the calculation of link correlation is helpful to improve the performance of the focused crawler3)In this paper,the calculation of topical correlation is modeled as classification problem,and ILNB algorithm is used to determine the topical relevance between anchor text and parent Web text.The experiments of different topics are carried out at the NetEase news website,and the experimental results show that ILNB algorithm can improve the performance of the focused crawler.
Keywords/Search Tags:Focused Crawler, Incremental Learning, Naive Bayes
PDF Full Text Request
Related items