Research And Implementation Of Focus Crawling Spider Based On A. T. C And Optimzied Hyperlink Chosen Strategy

Posted on:2009-03-30

Degree:Master

Type:Thesis

Country:China

Candidate:J Yin

Full Text:PDF

GTID:2178360245988809

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid growth of the Internet, the conflict between the growth of the web information and the ability of people obtain to it is becoming huger and huger. The search engine, an emerging area of technology also manifests its own importance. Web spider - the data supporter of search engines, becomes more and more advanced.In this thesis, the distributed characteristic of web pages, and analyzed the principle, strategy, structure composition, working model, dispatcher mechanism of web spiders have been researched deeply, and a web spider system under Windows environment - Focus Crawling Spider system - is implemented, which is developed with C++.Automatic text categorizations are introduced in Focus Crawling Spider system.The page topic distinguishing module is based on an algorithm which integrated "Simple Vector Distance", "KNN" and "Naive Bayes" method. In addition, we have designed "Invasive Fish Search (IFS)" method for the URL pruning module so that the spider system can pass through the "tunnels" easier, and crawl widely in the Internet.The design and implemention of the function modules in Focus Crawling Spider system are also discussed, including plenty of analysis and solutions of spider system's running bottlenecks. There are many new method brought in Focus Crawling Spider system.The Focus Crawling Spider system has been tested, and obtained satisfied results.

Keywords/Search Tags:

Search engines, Web spider, Focus Crawling, Automatic text categorization

PDF Full Text Request

Related items

1	Classification System Based On The Theme Of Information Acquisition In The Pages
2	Research On Web Information Retrieval Technology Based On Text Categorization
3	Research And Implementation On Optimizing The Focus Spider Arithmetic Based On Grid Technology
4	Mongolia Web Spider, Text Encoding Recognition And Conversion Research
5	Research And Application On Web Crawling And Text Mining Technology
6	A Research On Automatic WEB Documents Extraction And Classification
7	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
8	The Research And Implementation On The Spider Of The Vertical Search Engines Based On The Reinforcement Learning
9	The Theme Of The Search Engine Web Spider Search Strategy Study
10	Design And Implementation Of Commodity-Oriented Vertical Search System