Font Size: a A A

The Design And Implementation Of Internet Data Incremental Collection System

Posted on:2016-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q H MengFull Text:PDF
GTID:2298330467993007Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the growth of Internet data is exponentially explosive. All kinds of Internet portals, social media, blogs produce a large number of new Webpages and new data every day. These data may contain a large amount of valuable information. If we timely incremental collect these data and analyze them, its meaning is self-evident. The powerful tool for the incremental collection of Internet data is incremental web crawler. The design and implementation of an incremental crawler is the first step to obtain the valuable information.There are a lot of good formatted Webpages in the internet. There are a lot of link updated more frequeny in these Webpages. This kind of Webpage is called index type Webpage. The analysis and collection of this kind of Webpage can improve the efficiency of the incremental web crawler and is very important for finding new information in the Internet. This paper designs and implements a data incremental collection system for index type Webpage. The system is developed based on Heritrix3.1.1and improves the incremental function of Heritrix. This system implements a set of interfaces for the development of index Webpage, developers can quickly add new data source into the system.Firstly, this paper researches on the technology principle of Heritrix and proposes a scheme to improve the function of Heritirx. Then this paper designs a kind of incremental strategy for index Webpage based on its characteristics. This paper also proposes some solutions to solve the runtime encountered problems of web crawler. In the related chapters about the system overall design and detailed design, this paper detailedly descripts the design and the implementation of Internet data incremental collecttion system. After the completion of system development, a large number of functional testing and performance testing has carried on and proved the design goals has achieved. The system is stable running now and incremental collects a lot of data, fully verify the availability and reliability of the system. This paper finally carries on a summary of the development of the Internet data incremental collection system, points out the problems and the future improvement direction of the system.
Keywords/Search Tags:Web Crawler, Incremental Crawling, HeritrixIndex Webpage
PDF Full Text Request
Related items