Font Size: a A A

Web Crawler System Based On Chrome Extension

Posted on:2017-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:S P WeiFull Text:PDF
GTID:2308330503953780Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, network information has been growing in a explosive speed, for instance, the daily blogs posted on Sina Weibo reach to 120 million. However, the difficulty for users to get the data they need has been increasing in the context of unprecedentedly rich information. Scattered result presented by traditional search engine, like Baidu and Google, has no longer fulfilled users’ requirements, what they need more is the data of valid integration in professional data analysis and daily life. Crawler is one of technologies used in the process of internet data integration. However, common crawler technology used currently has difficult exploitation and poor stability, and is not user friendly, which can not meet the needs of user. Therefore, it is valuable to develop a new crawler system with the feature of simple expanding development, high stability, wide application and user friendly.This article firstly analyzed current crawler system, crawler technology and anti-crawler strategies used both in China and abroad, as well as the reason leading to the complex implementation process of web crawler system, poor stability and user unfriendly, as a result, a new crawler system based on Chrome was created. Furthermore, in order to fulfill different user needs and give play to advantages of internet, two kinds of information capture modules in web crawler system based on Chrome expansion were proposed, which were personal version information capture module extension and server version information capture module extension. Finally, in order to support the high concurrency requirements of central server module for personal version information capture module, central server module based on Netty framework and database module using Master-slave Database configuration, and in order to make central server module extend better as more requests are made, this article used program to interface and introduced spring framework to manage the dependencies between the central server module and category.The crawler system designed and developed in the article has the features that easy development, extended to facilitate and supporting many webpage types, including static webpage, asynchronous loading webpage and dynamic webpage, and personal version information capture module can also maximize the advantage of internet to grab information using each web crawler user. As the result presented in the test environment, all the features presented above has been successfully implemented and it performs much better than other current crawler system in the field of user friendly and capacity.
Keywords/Search Tags:Crawler, Chrome Extention, JavaScript, Netty, Master-slave Database
PDF Full Text Request
Related items