Font Size: a A A

Design And Implementation Of Topic-specific Web Crawler

Posted on:2010-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiuFull Text:PDF
GTID:2178360275982472Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of network, internet becomes the carrier of large amount of information. How to gather and utilize the information effectively becomes a great challenge. Traditional search engine can't keep up with the more and more rigorous and prolific search requirements from different users.Recently, topic-driven search engine come into existence as the situation requires, which is better classified, containing more profound and focused data, and being updated in time. This thesis mainly studies some related algorithms on hypertext classification and search strategy of focused crawler.This thesis firstly introduces general development and some techniques of focused crawler. Then, some analyses and remarks are made to core techniques of focused crawler, this thesis major focuses on hypertext classification and search strategy of focused crawler, which provide theoretical basis to IL-Crawler (Incremental Learning Crawler) that we develop.In the research of hypertext classification, considering the contradiction of huge calculated amount and inefficiency result from Chinese Word Segmentation and vector model,the thesis based on incremental learning idea, a webpages recognition algorithm is proposed. According to the characteristics of webpages, the characteristic including HTML, URL and webpage text are extracted to construct feature value. Machine Learing algorithm is used to construct decision tree model to recognize webpages and avoid the problem of Chinese Word Segmentation. If the recognizing precision is lower than the predefined threshold, the proposed algorithm adds the characteristics of incorrectly recognized webpages to renew decision tree model and improve the model precision. Take Blog web page as experimental subject, experimental results show that the algorithm has high recognizing precision to webpages and can recognize topical webpages effectively.In the research of search strategy, considering the contradiction of the efficient search strategy that the dynamic,complex and semi-structured properties of web require, a new search strategy based on co-weighting multi-information is proposed after analysing authority first and similary first traditional search strategy. Owing to diversity and flexibility of web page, with limited time and resources, how to gather web pages which are topic-related and important becomes one of the core techniques. After analysing hypertext classification algorithm above, combine topic similary predicted value with web page importance predicted value, a new search strategy that major focuses on topic similary, also considerding page importance is proposed. Experimental results show that the new search strategy is better than single authority first and similary first search strategy, and has better harvest rate.In view of the above work, based on .NET technology, this thesis design and develop Blog focused Clawler named IL-Cralwer, which have ability to distributed data gathering and incremental learning. Experimental results show IL-Crawler has better precision accuracy of data gathering.
Keywords/Search Tags:Focused crawler, Hypertext classification, Search strategy, Incremental learning, Decision tree
PDF Full Text Request
Related items