Font Size: a A A

Design And Implementation Of Website Text Data Acquisition System

Posted on:2016-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:D TianFull Text:PDF
GTID:2298330470955549Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The Internet public opinion monitoring system is a product of the development of new media. It can monitor the spread of network information in real time. The monitoring of the public opinion enables the users to discover, know about, and to track the public opinion in the first place. Therefore, it makes prevention of crime possible. The web crawler, as a part of the public opinion monitoring, defines the real-time function of the monitoring. This thesis designs and realizes the data acquisition system. It customizes and crawls the contents of the targeted sites through configuring website template. Then in this way it could provide real-time data sources.The data acquisition system designed by this thesis crawls the contents of the targeted sites majorly through the resource allocation of the crawler as well as the two subsystems, that is, the monitoring platform and the information crawling platform. The resource allocation of the crawler and the monitoring platform employs JavaEE open-source development frameworks such as Struts2, Spring and some others. The platform utilizes the hierarchical structure and modular design of the system, and therefore, successfully increases its productivity and extendibility. The information crawling platform makes references from the framework of the Heritrix of SourceForge open-source crawler. It has been redesigned and redeveloped so as to adapt to the demands of the its own products. The duty of the resource allocation of the crawler and the monitoring platform is to allocate the crawled information, which includes sites, network channels, seeds, templates and some other configuration information. Moreover, the platform can also test and verify the accuracy of the configured templates. At the same time, the platform provides a dynamic diagram of the crawled information which makes the users’monitoring of the amount of the crawled information much more convenient. What’s more, it can export the records of inaccurate templates and correct them. The information-crawling platform major concerns the crawling of the website information. It can crawl the contents of a webpage through four steps, namely, seed loading, webpage loading, webpage parsing and data storage. In the process of system designing and developing, the author here completed the five tasks listed as follows:(1)To gather the users’demand and to investigate the current conditions of the crawler industry, therefore, to figure out the overall demands of this system as well as the functional demands of each template. (2)To design the overall structure of the system and to divide the functional modules.(3)To figure out the solution of the functions of each module according to the division of functional modules. Besides, the author has completed the design of those modules including the information configuration management, template testing, crawling records, acquisition of crawled seeds, HTML loading, template parsing, data enqueue, etc.(4)To programme the functional modules according to a concrete plan.(5)To test the functions of those modules which bear great importance, and to check the accuracy of acquisition.This system, as a test version, can satisfy the basis needs of users. Nevertheless, it is still not a competitive product of this industry. In the future, we need to improve on the configuration of module’s automation and the efficiency of crawler’s acquisition of information. In this way, we could make it competitive and bring considerable profits to the company.
Keywords/Search Tags:Public Opinion Monitoring, The crawler, JavaEE
PDF Full Text Request
Related items