Scalable Big Data Analysis Platform Based On PostgreSQL And Spark

Posted on:2017-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:M Cheng

Full Text:PDF

GTID:2348330503479039

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of big data, traditional data warehouse and data analysis field is undergoing profound technological change. Emerging data analysis platforms gradually come out. As an unstructured data analysis platform good at batch processing, Hadoop is becoming a standard platform of large data processing. Spark is a distributed computing engine and compatible with Hadoop. Its memory computing model makes it achieve a leap in performance with respect to Hadoop, and Spark is currently the standard tool for machine learning algorithms on large datasets. New platforms provide more choice of data analysis tools, but the recently survey report shows that SQL query is still the main mode of data analysis in most of businesses and companies. However, the increasing scale of data result in the increasingly demand of data analysis depth. How to enhance the analysis capabilities of relational database on the basis of maintaining SQL? The current commonly used solutions are the MPP analytic databases and the relational databases coexist with other analysis systems. But both solutions have a series of longitudinal extension limitation and management issues. This paper first proposes a scalable big data analysis platform based on Postgre SQL and Spark, or PSS for short. It combines the ease of operation in Postgre SQL and the computing power in Spark loosely coupled together. It has a powerful distributed computing power and the machine learning algorithms ability while maintaining the ease of operation and SQL analysis capabilities of relational database. For loosely coupled connecting the two heterogeneous platforms, this paper presents Dex middleware based on Thrift framework, which up to communicate with the UDF of Postgre SQL and down to communicate with Spark cluster. For cross-platform data transfer, this paper proposes Dex RDD program by modifying the kernel source of Spark, which avoids a lot of disk I/O consumption. PSS is very easy to operate, user just need to execute SQL extension functions in psql client to call the algorithm model in the Spark clusters. Proved by experiment, PSS platform has good accuracy, efficiency and scalability, and the extensibility reflected in the isolation physically between data storage and algorithm models where both are able to extend. User can add custom algorithm models according to the feature of data source. This paper achieves a real-time traffic prediction system based on PSS platform.

Keywords/Search Tags:

Postgre SQL, UDF extension, Spark, Memory computing, Dex Middleware

PDF Full Text Request

Related items

1	Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark
2	Research On Workload-specific Memory Configuration Of Spark Workloads
3	Research On Memory Optimization Technology Of Spark Computing Engine
4	Research And Implementation Of Memory Optimization Based On Parallel Computing Engine Spark
5	On The Low Overhead Configuration Optimization Of In-memory Big Data Query Engine
6	Research On Memory Data Management Technology In Spark
7	Research On Spark Performance Optimization Technology For In-Memory Computing
8	Research On Memory Management And Cache Replacement Policies In Spark
9	Research On Significant Technologies Of Performance Optimization On In-memory Computing Framework
10	Research On Key Techniques Of Hybrid Memory Management For Big-Data Application