Font Size: a A A

Design And Implementation Of Crawler Technology For Topics

Posted on:2010-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:L Y TanFull Text:PDF
GTID:2178360275953318Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Web information,how to effectively get useful information in the Web have become difficulties.Search Engine play an important role in Information Retrieval for people in their daily lives and become an indispensable tool when search in lnternet.Yahoo,Google,MSN,and Baidu are the most successful examples in those of the large number of commercial general-purpose search engines.These general-purpose search engines are sometimes lose its direction as networks become more complex.However,in recent years various search technologies emerging,such as streaming media search based on P2P technology,meta search technology,vertical search technology have become a hot research in the field of search.The core work in this paper is studying domain-based Crawler.Firstly,we analyze a large-scale search engine deeply,state its working principle in details,and analyze their advantages and disadvantages in several commonly used search strategy,and then analyzed the difficulties of Web crawler technical implementation from two aspects.First,some technical issues must be resolved in general-purpose search engines,and the other is the limitation of the self-existence. Then we put up with an implementation diagram of domain-related network crawler.Taking into account how to overcome concurrency,as well as the problem of the occupying of the network bandwidth,we propose to re-design of a DNS resolver in order to efficiently facilitate network bandwidth and reduce network transmission latency,for efficient crawling the pages and ensure that parallel crawling in Web,as well as the communication of any process and enable the various components work efficiently,a non-blocking socket technology is introduced in our designing.URL The scheduling technology of URL in the network system designing play a key role,a probabilistic model based on metrics are proposed that inspired from a set of rules,so that our crawler system has a more intelligent routing function.In order to always access to some topic pages that the user set.After given a probability-based model that measure,we further proposed tunnel technology based on the best-first search strategy used to overcome the topic drifting during a crawling on many occasions.If deviated from the original topic,the crawler can stop working in such URL so quickly,so remove a URL in the head of queue as a URL to crawl as next starting point.In light of the integrity of technology,we briefly give the other related technology implementation of the crawler.Text classifier is an indispensable crawler technology component in topic network crawler.The principle of Bayesian classification is simple and the implementation is not overly complicated compared with other classifier,and has a high performance.In this paper an improved Bayesian classification algorithm proposed,General Bayesian classifier believe the same important of the probability about all the words,here we are more inclined to the words in the title of the item.Finally,we designed a prototype of Focused-Crawler,and give the experimental data by analysis,testing,and comparing the merits and shortcomings of the algorithms.
Keywords/Search Tags:Crawler, text classification, Probability model, search engine
PDF Full Text Request
Related items