The Research Of Big Data Manipulating Technology Based On Spark

Posted on:2017-05-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Wang

Full Text:PDF

GTID:2428330590968339

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the world has walked into big data era,in which data has quite a few different characteristics,including explosive increase in data volume,variety in data type,low density in data value and timely requirement in data processing.A lot of open-source big-data projects have spaing up since Google published their thesis of MapReduce,BigTable and GFS.With the maturity of big data technologies,an ecosphere has formed and is getting increasingly flourishing.Big data technologies can mainly be classified into four categories,which are data collecting,data integrating,data storage and data analysis.Data collecting technologies collect data from scattered data sources.Data integrating technologies gather data from multiple collection systems with data washing and filtering.Data storage technologies store data in disk.Data analysis technologies analyze data.Log analysis is a typical application of big data process technologies.Logs are important to every company,which not only includes user action records,but also program exceptions,program performance and system running status.With the analysis of logs,enterprises can extract commercial value from user actions,locate program errors and performance bottlenecks,and discover potential system risks.Contemporarily,batch processing is still the major method to process log,which takes the timeliness in return and makes applications such as system alert,intrusion detection and risk analysis impossible.With deep research in key technologies of big data collecting,data integrating and data processing,this article makes contribution in constructing a real-time collection and analysis platform based on Spark,which meets the requirements of log processing and analyzing.The platform is divided into three parts.1.The data collection layer is designed to collect data from servers.Double layer collection model is deployed to ensure the transparency,fault tolerance and scalability of data collection layer.Weighted round robin algorithm gives the system a better load balance performance.Distributed configuration management service simplifies the management of collection clients.2.The data integration layer is designed to integrate data from scattered sources and relieve the pressure of data analysis with data burst.With the new topic allocation algorithm,nodes in integration layer get better load balance performance.3.The data analysis layer is designed to dispute with real-time data by means of Spark.The layer has data search and data process features.Dynamic SQL makes it possible to change SQL content without resubmitting Spark job.Cache system based on streaming data model can reduce time in fetching data and then speeds up the job by 10%to 20%.

Keywords/Search Tags:

big data, log, data collecting, data integrating, Spark, real-time processing

PDF Full Text Request

Related items

1	Design And Implementation Of Data Real-time Analysis And Processing System Based On Spark
2	Design And Implementation Of Real-time Data Collecting Transmitting And Management System For Huayuanreli Heating Company
3	Design And Implementation Of Tobacco Big Data Analysis System Based On Spark
4	Design Of Real Time Large Data Processing System Based On TF-IDF Improved Computation Model
5	Research And Development On Multi-source Traffic Data Automated Collecting And Processing System
6	Application Research Of Real-time Data Analysis Based On Spark Computing
7	Research On Big Data Governance Platform Of Multi Data Sources Based On Spark
8	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
9	Real-time Mass Data Processing Analysis And Optimization Based On Spark
10	Design And Implementation Of Real-time Data Processing Platform Based On Event Driven Architecture