Font Size: a A A

Research On Focused Crawler Based On SVM Classification Algorithm

Posted on:2012-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z W LiFull Text:PDF
GTID:2218330368983059Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the information on the Internet is diversely growing. So how to locate the information the users need quickly, accurately and efficiently becomes a main goal of the search engine. The general search engine has the advantage of obtaining the information from a wide range. But as it involves in the broad fields, the information provided by general search engine lacks in expertise and depth in the specific area. Thus, the focused search engine came into being. It can provide professional, accurate and intensively search service. This paper focuses on the focused crawler and researches on the crawling tragedy on how to effectively crawling the topical web pages.This paper has reviews on the focused crawler and analyzes the advantages and disadvantages of the current crawling strategies from the main framework of the general crawler and focused crawler, the text-based heuristic crawling strategy and the evaluation method based on web link structure.The web page is presented by Vector Space Model. This paper studies on the principles of support vector method and kernel method, proposes the topical degree of correlation prediction algorithm based on the context and link structure that predict the topical degree of correlation of the page that has not been crawled.For the crawled web pages, first, the SVM classification filters the pages that are not relevant to the topic. Then, the topical subgraph is constructed by HITS algorithm and the authority page or central pages are selected as seeds for the next crawling.The crawling tragedy is studied on the TSE. This paper has built a focused crawler based on SVM classifier that is combination of topic degree of correlation prediction algorithm based on the context and link structure, SVM classification algorithm and HITS algorithm. The experiments show that, the proposed focused crawler based on SVM classifier can crawl the topic correlation page better.
Keywords/Search Tags:vector space model, HITS, SVM, focused crawler
PDF Full Text Request
Related items