Font Size: a A A

URL Classifier Algorithm Based On Decision Tree And Platform Design Of Focused Crawler

Posted on:2017-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:F B JiangFull Text:PDF
GTID:2348330488963545Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The huge Internet has been evolved into a very big knowledge base. Exploring,mining and analyzing the knowledge base is a popular application field at present. Before exploring the knowledge base and extracting knowledge, the first step is to collect raw data. Faced with such a big knowledge base, using general search engines like Google and Yahoo to find user-defined high quality pages is difficult to get good results. Moreover, user-defined topics related to high-quality polymeric webs are often not populated together, instead of scattering in the Internet, which adds an additional burden for obtaining high-quality raw data. General search engine crawlers usually designed with breadth-first strategy, namely, general search engine crawlers crawling web pages according to certain hierarchical order on the Internet. The goal of general search engine crawler is to extensively collect web pages as much as possible.The difference between general web crawler and focused web crawler is that focused web crawler owns a strategy to guide crawler crawling direction. These strategies include crawling strategy based on link structure and page content, which are heavily used in focused crawler designing. Focused crawler can find user-defined pages intentionally with the guidance of web crawling strategy,which can help reduce the useage of bandwidth. The accuracy of focused crawler to crawling user-defined topic web pages is an important indicator to measure the performance of focused crawler.Firstly, this thesis makes a depth research on the basic principle and system architecture of focused crawler, detailed research on crawling strategy based on web link structure and web page content and comparing the the advantages and disadvantages of these algorithms and their usage scenarios. Then elaborate the web page text processing technologies, which include HTML document parsing methods based on DOM tree analysis and regular expressions, word stemming, text representation based vector space model and text similarity calculation based on vector space model.Secondly, the URL classification algorithm based on the decision tree is proposed after the detailed research on basic principles and architecture of focused crawler. This classification algorithm uses four HTML tags for URL classification. These four tags are: <h1>, <h2>, <h3> tag, page title, link anchor text and link context. Take advantage of the similarity of the four tags of its text content and user-defined topics to build a decision tree and perform classifing of other links in current web page. And then put the URL that is relevant to user-defined relel topic into the prior crawling URL queue, others into the delayed crawling URL queue. When the prior crawling queue is empty the delayed crawling queue is crawled, thus ensuring a high page crawling accuracy to some extent and avoid the "tunneling through" problem of focused crawler.Finally, this thesis take advantage of open source web crawler framework to design the URL classification algorithm based on decision tree. The results show that compared with the traditional implementation of focused crawler based on Fish-Search algorithm, the implementation method this thesis proposed make the crawling accuracy incresed in about 5%-7%.
Keywords/Search Tags:Focused Crawler, Decision Tree, URL Classifier, Crawling Policy
PDF Full Text Request
Related items