Font Size: a A A

The Research And Implement Of Topic-focused Web Crawler Based On SVM Classification Algorithm

Posted on:2010-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:S J WuFull Text:PDF
GTID:2178360275979938Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Web crawler enables searching engine to download web pages, which is an inseparable part of the searching engine. Starting from the seeding links, normal web crawler searches all the web pages throughout the internet. The topic-oriented web crawler, apart from the fundamental function of downloading web pages, is also able to analyze links and page content. The topic-oriented web crawler mainly aims to the topic-focused query and it, which does not put priority to cover all the pages in the internet, serves to capture specific web pages related to a certain topic. The topic-focused web crawler has become the hot topic in the web information mining and capturing field and shows great importance to the information searching industry. This paper, centered on the support vector machine application of algorithm in the topic-focused web crawler, researches the following aspects:The principle of Support Vector Machine classification algorithm is researched in this paper. First, the paper describes the mathematical representation of web pages., In the sequel, an improved web classification algorithm based on the Support Vector Machine is presented, which classifies web pages by 2-classifiers based on SVM and finds out the web pages in the topic-specific class. Then topic specific web pages are classified into several child classes with vector space model (VSM).During the process of constructing Support Vector Machine (SVM), a kind of bias adjustment is introduced in order to enhance the recall rate of classification. This algorithm has updated classification function, in which only 2-classifiers are needed to be calculated. This has greatly reduced the false classification of web pages. Experiments have shown it does not only bring effective training but also achieves high classification accuracy rate as well as recall rate.Centered on algorithm and crawling target of topic-focused web crawler, the crawler working process and function modules are redesigned. Furthermore, HTTP analysis technology, multithread technology and added value inspection technology are employed. All these technologies realize the topic-focused web crawler Percaspider based on SVM topic classification algorithm, test the overall function of crawlers, display and analyze the results. Experiments have shown that new topic-focused Web crawler is ideally effective in terms of both download speed and accuracy rate, which ensures the validity and practicability of the crawler.
Keywords/Search Tags:Topic-focused web crawler, SVM, Web pages classification, classification function, Multithread technology
PDF Full Text Request
Related items