Font Size: a A A

Design And Implementation Of Data Processing And Analysis System Based On Spark

Posted on:2016-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2308330470455538Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Application systems expansion magically with the rapid growth of computer and information technology, at the same time, the quality of data that come from the application systems increase explosively. It is necessary and urgent to find effective technology and rules right now. In modern world, Hadoop and Spark are the most popular distributed computing framework. For the purpose of meet their business requirement and improve quality of their products, increasing number of companies and institutions began to learn and use these two technologies which have gradually matured. In this context, a company puts forward to build a data processing analysis system based on Spark, namely ATL-DPAS(data processing and analysis system of acn technology lab). This system could not only compatible with Hadoop cluster, but also flexibly process the numerable data, take real-time remote inquiry and visual analysis according to the existing computing resources.According to different functional requirements, this system is divided into three modules which include data processing module, data query module and data modeling module. I was primarily responsible for the design and development of data processing module, including the design of flow, the design and implementation of data cleaning or merging algorithm, and to implement a variety of data processing interface. In this paper, firstly the research status of ATL data processing and analysis system is reviewed, then this paper introduces the related theory and technology. Secondly, in order to tease out aims of this system, this paper describes the system requirements analysis, including functional requirements and non-functional requirements. Then introduces system architecture design and database design. Thirdly, accomplishs flow design, code realization and interface display for data processing module which includes HDFS list, data adding, data cleaning, data combining and data type management. Finally, introduces the process of system deployment and the way of system test. Then describes detailedly function test and performance test of data processing module, and evaluates the test results. So as to confirm the excellent performance of Spark, also proves the validity and practicability of this paper.At present, this system is in the stage of trial operation, there are hundreds of G level data is processed and analyzed every day. The operation results show that the system can run with excellent performances and stable operation, and it achieves the desired objectives.
Keywords/Search Tags:Big data, Data processing, Spark, HDFS, Hive
PDF Full Text Request
Related items