Font Size: a A A

Design And Implementation Of Data Real-time Analysis And Processing System Based On Spark

Posted on:2019-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:P ChenFull Text:PDF
GTID:2348330563953970Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and Internet of Things technologies,people are getting more and more information from the Internet.At the same time,more and more information is being transmitted to the network,and huge amounts of data are generated every moment.In the increasingly mature distributed technology today,the storage and management of large amounts of data has been better resolved through distributed file systems.Search and search of information in massive data,Hadoop,HBase and other technologies have also been able to solve most of the business scenarios.demand.However,people's pursuit of science is endless and can not be furthered than before.The pursuit of more real-time data processing and analysis has become a common desire in all fields.A lot of valuable knowledge and potential law information are hidden in the data.The value of these data will decrease with the passage of time.How to effectively receive and manage these data and quickly analyze,and explore the information behind the calculated data,real-time statistics,forecasting and decisionmaking functions have become major development opportunities and research hotspots.There is an urgent need for an efficient,fast,stable,high-throughput real-time analysis and processing system for efficient,real-time and accurate statistical analysis of data from various data sources.Nowadays,the types of big data are becoming more and more complex.The popular solution in the industry is to develop and design different processing subsystems for different business scenarios and data types.For example,Storm analysis is used for realtime flow analysis,Hadoop is used for offline data analysis,and machines.Learning to establish additional modules,etc.,then organize these subsystems into a large enterprise system through techniques such as message queues and caching.Although such an approach can be applied in production practice,significant learning and research costs are huge,and there is a lack of a unified computing system platform,making it difficult for developers to maintain and expand a system composed of so many sets of technology systems.Aiming at the above needs and problems,this thesis designs a general-purpose realtime data processing system based on Spark,which mainly includes new ETL and realtime processing engine modules.It is dedicated to real-time data processing that is better than traditional Hadoop Reduce technology and can deal with heterogeneous data.The source collects and implements fast calculations.At the same time,it possesses versatility and stability.It integrates real-time flow calculation,rapid batch calculation,and machine learning.It integrates various types of data calculation.In addition,the design absorbs the ideas of many excellent technologies such as Kafka and Redis within the big data ecosystem.Developers only need to face a set of technical frameworks and simple data flow,and they can easily implement real-time data processing services and reduce system complexity.Sexuality and maintenance burden,as well as scalability,and optimization strategies for data skew.In terms of system construction and deployment,based on Docker container technology and Kubernetes container application orchestration technology,system clusters are characterized by elastic scalability,high resource utilization,resource monitoring,rapid deployment,and portability.
Keywords/Search Tags:Big Data, Generic, Spark, ETL, Real-time processing, Extensible, Docker, Kubernetes
PDF Full Text Request
Related items