Font Size: a A A

Research And Implementation Of Focus Crawling Spider Based On A. T. C And Optimzied Hyperlink Chosen Strategy

Posted on:2009-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:J YinFull Text:PDF
GTID:2178360245988809Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of the Internet, the conflict between the growth of the web information and the ability of people obtain to it is becoming huger and huger. The search engine, an emerging area of technology also manifests its own importance. Web spider - the data supporter of search engines, becomes more and more advanced.In this thesis, the distributed characteristic of web pages, and analyzed the principle, strategy, structure composition, working model, dispatcher mechanism of web spiders have been researched deeply, and a web spider system under Windows environment - Focus Crawling Spider system - is implemented, which is developed with C++.Automatic text categorizations are introduced in Focus Crawling Spider system.The page topic distinguishing module is based on an algorithm which integrated "Simple Vector Distance", "KNN" and "Naive Bayes" method. In addition, we have designed "Invasive Fish Search (IFS)" method for the URL pruning module so that the spider system can pass through the "tunnels" easier, and crawl widely in the Internet.The design and implemention of the function modules in Focus Crawling Spider system are also discussed, including plenty of analysis and solutions of spider system's running bottlenecks. There are many new method brought in Focus Crawling Spider system.The Focus Crawling Spider system has been tested, and obtained satisfied results.
Keywords/Search Tags:Search engines, Web spider, Focus Crawling, Automatic text categorization
PDF Full Text Request
Related items