With the rapid growth of the Internet, the conflict between the growth of the web information and the ability of people obtain to it is becoming huger and huger. The search engine, an emerging area of technology also manifests its own importance. Web spider - the data supporter of search engines, becomes more and more advanced.In this thesis, the distributed characteristic of web pages, and analyzed the principle, strategy, structure composition, working model, dispatcher mechanism of web spiders have been researched deeply, and a web spider system under Windows environment - Focus Crawling Spider system - is implemented, which is developed with C++.Automatic text categorizations are introduced in Focus Crawling Spider system.The page topic distinguishing module is based on an algorithm which integrated "Simple Vector Distance", "KNN" and "Naive Bayes" method. In addition, we have designed "Invasive Fish Search (IFS)" method for the URL pruning module so that the spider system can pass through the "tunnels" easier, and crawl widely in the Internet.The design and implemention of the function modules in Focus Crawling Spider system are also discussed, including plenty of analysis and solutions of spider system's running bottlenecks. There are many new method brought in Focus Crawling Spider system.The Focus Crawling Spider system has been tested, and obtained satisfied results. |