Font Size: a A A

Research And Realization Of Topical Crawler Based On Content And Hyperlink

Posted on:2017-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:P D WangFull Text:PDF
GTID:2308330488951950Subject:IC Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the Web has become a large-capacity global information database, how to make users to access professional and personalized information quickly on the Web has become a hot issue. The conventional general search engine returns excessive and subjectivity weak results, in this case, faced with a particular theme vertical search engine was put out, it can provide users with more accurate and professional information services.Topical crawler is the core of vertical search engine, also is the main content of this thesis. The topical crawler’s feature is that based on artificial preset topics and keywords to get related resources on the Web. The traditional strategy of topical crawler is only based on content evaluation or link structure to predict the importance of the hyperlinks, crawling strategy based on content evaluation only considers part of the text information, the judgment on the topical relevance is more one-sided and does not consider the influence of link structure; although the crawling strategy based on link structure calculates the importance of the page from a global perspective, but because it does not consider the topical relevance in the crawling process, so often there will be "the theme of migration" phenomenon.To solve the above problems, this thesis proposes a topical crawler based on content and hyperlink, details are shown as follows:(1) Construction of improved Naive Bayesian model to extract keywords during the topical crawler crawling pages on the Web. In this thesis, we metric the correlation between each attribute and also between each attribute and class attribute according to the mutual information, so that cluster the word frequency, word location, word span, word length, part of speech these five attributes to construct improved Naive Bayesian model, to meet the property between "independence assumption" conditions, the improved algorithm has higher keyword extraction accuracy.(2) We propose a Web information extraction method based on the density distribution of the text, this method relies too heavily on the label, with good extraction effect, able to crawl pages of information are structured information extraction, extracts from the title and paragraphs information for subsequent extracting page keywords provides a complete web text information.(3) In this thesis, we use pre-programmed keywords with the current theme keywords to establish vector space model, calculate page topical relevance with law of cosines, and bring it as a parameter to the original PageRank formula, improve the original PageRank algorithm to calculate the importance of the current page more precisely based on content.(4) According the importance of links to determine the access priority, and then divides the hyperlinks to the high priority or low priority URLs queue, discards irrelevant hyperlinks, to ensure that the topical crawler always priority crawling high quality hyperlinks so that improves the efficiency and accuracy of crawling.Finally, we realize the design of topical crawler in a real environment and analyze the results. In order to verify the performance advantages of the proposed topical crawler from multiple perspectives, first of all, test the improved Naive Bayesian algorithm in the keyword extraction, the results show that the improved algorithm has a high extraction accuracy. On this basis, compare the proposed topical crawler with crawler-based PageRank algorithm and Shark-Search algorithm overall system performance, the experimental results show that the proposed topical crawler has a higher precision and amount of information, the performance is more superior than existing methods.
Keywords/Search Tags:Topical Crawler, Naive Bayes, Topical Relevance, PageRank
PDF Full Text Request
Related items