Font Size: a A A

The Research Of Big Data Manipulating Technology Based On Spark

Posted on:2017-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z P WangFull Text:PDF
GTID:2428330590968339Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the world has walked into big data era,in which data has quite a few different characteristics,including explosive increase in data volume,variety in data type,low density in data value and timely requirement in data processing.A lot of open-source big-data projects have spaing up since Google published their thesis of MapReduce,BigTable and GFS.With the maturity of big data technologies,an ecosphere has formed and is getting increasingly flourishing.Big data technologies can mainly be classified into four categories,which are data collecting,data integrating,data storage and data analysis.Data collecting technologies collect data from scattered data sources.Data integrating technologies gather data from multiple collection systems with data washing and filtering.Data storage technologies store data in disk.Data analysis technologies analyze data.Log analysis is a typical application of big data process technologies.Logs are important to every company,which not only includes user action records,but also program exceptions,program performance and system running status.With the analysis of logs,enterprises can extract commercial value from user actions,locate program errors and performance bottlenecks,and discover potential system risks.Contemporarily,batch processing is still the major method to process log,which takes the timeliness in return and makes applications such as system alert,intrusion detection and risk analysis impossible.With deep research in key technologies of big data collecting,data integrating and data processing,this article makes contribution in constructing a real-time collection and analysis platform based on Spark,which meets the requirements of log processing and analyzing.The platform is divided into three parts.1.The data collection layer is designed to collect data from servers.Double layer collection model is deployed to ensure the transparency,fault tolerance and scalability of data collection layer.Weighted round robin algorithm gives the system a better load balance performance.Distributed configuration management service simplifies the management of collection clients.2.The data integration layer is designed to integrate data from scattered sources and relieve the pressure of data analysis with data burst.With the new topic allocation algorithm,nodes in integration layer get better load balance performance.3.The data analysis layer is designed to dispute with real-time data by means of Spark.The layer has data search and data process features.Dynamic SQL makes it possible to change SQL content without resubmitting Spark job.Cache system based on streaming data model can reduce time in fetching data and then speeds up the job by 10%to 20%.
Keywords/Search Tags:big data, log, data collecting, data integrating, Spark, real-time processing
PDF Full Text Request
Related items