Research And Application On Dynamic Compression Technique On In-Memory Column Oriented Dataset

Posted on:2017-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Jiang

Full Text:PDF

GTID:2428330590988889

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing popularity of information technology,it has been impossible for a local machine to handle large data set in a reasonable time.Hadoop was developed to solve this problem by taking distributed computing into production environment,which introduced MapReduce and HDFS that were aimed at the domain respectively of computing and storage.However due to the fact that Moore's Law has slowed down,the performance of hard disk has not make a significant progress in the past few years,even thought the average data size gener-ated on internet has been increased massively.Under this circumstance,the idea of in-memory computing was raised by University of California in Berkeley,they developed Spark that has been highly successful in implementing large scale data intensive applications,especially for those that reuse data across multiple parallel operations.However due to the fact that memory resources are still costly,it is inevitable to find a solution to manage memory resources in a better way.In this paper,we presented an elastic data persisting solution on column oriented data set for Spark,which enables data compression to save more heap space for Java Virtual Machine and reducing disk I/O throughput for faster data access.We Effectively tested three common compression algorithms and concluded their suitable target data type,then we mathematically derived the criteria for selecting the optimal data compression and persisting plan.It is very convenient for a column oriented data set to perform the data compression.Based on the hypothesis and test results,we managed to design and implemented a data compression module for Spark that enables dynamic data compression both for Spark RDD and Dataframe.Our evaluation of the preliminary prototype of this elastic data persisting solution shows that it can provide resource management recommendations by accounting for input data type,memory space and CPU resource,and can consistently yield high performance that accel-erates Spark up to 6x.In order to test the performance of the in memory computing system,we developed a server log big data real time analysis system,providing message queue service for log aggregation,dynamic compression plan for column oriented data sets as well as SQL query interface for in memory data set.There are mainly three highlights of this research,first of all,it introduced the idea of dynamic data compression concept that reduced the work for application developer,they no longer need to tune their application by repeating tests and changing configurations.Secondly,it enables users to perform SQL query both on real-time analysis and off-line data.Last but not the least,by taking other popular big data analyzing tools into this tool chain,we developed a log big data analysis framework,which not only applied our research into practice,but also proved the possibility of compression and ensured the performance.

Keywords/Search Tags:

In-Memory Computing, Column Data Set, Data Compres-sion, Apache Spark, Real-Time Streaming Analysis

PDF Full Text Request

Related items

1	The Enterprise Data Real-Time Analysis System Based On In-Memory Database
2	A System For Distributed MD Data Analysis Based On Spark
3	Application Research Of Real-time Data Analysis Based On Spark Computing
4	The Design And Implementation Of The Traffic Flow Data Real Time Processing System Based On Apache S4
5	Design And Implementation Of A Streaming SQL Real-time Computing Platform Based On Apache Flink
6	Real - Time Analysis Of Enterprise Financial Data Based On Memory Computing Technology
7	Real-time Detecting Of DDoS Attacks Based On Spark-streaming
8	Design And Implementation Of Data Real-time Analysis And Processing System Based On Spark
9	Research And System Implementation For Search Data Analysis Based On Streaming Computing
10	Research On The Performance Modeling Of Spark Streaming