Font Size: a A A

Scalable Big Data Analysis Platform Based On PostgreSQL And Spark

Posted on:2017-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:M ChengFull Text:PDF
GTID:2348330503479039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of big data, traditional data warehouse and data analysis field is undergoing profound technological change. Emerging data analysis platforms gradually come out. As an unstructured data analysis platform good at batch processing, Hadoop is becoming a standard platform of large data processing. Spark is a distributed computing engine and compatible with Hadoop. Its memory computing model makes it achieve a leap in performance with respect to Hadoop, and Spark is currently the standard tool for machine learning algorithms on large datasets. New platforms provide more choice of data analysis tools, but the recently survey report shows that SQL query is still the main mode of data analysis in most of businesses and companies. However, the increasing scale of data result in the increasingly demand of data analysis depth. How to enhance the analysis capabilities of relational database on the basis of maintaining SQL? The current commonly used solutions are the MPP analytic databases and the relational databases coexist with other analysis systems. But both solutions have a series of longitudinal extension limitation and management issues. This paper first proposes a scalable big data analysis platform based on Postgre SQL and Spark, or PSS for short. It combines the ease of operation in Postgre SQL and the computing power in Spark loosely coupled together. It has a powerful distributed computing power and the machine learning algorithms ability while maintaining the ease of operation and SQL analysis capabilities of relational database. For loosely coupled connecting the two heterogeneous platforms, this paper presents Dex middleware based on Thrift framework, which up to communicate with the UDF of Postgre SQL and down to communicate with Spark cluster. For cross-platform data transfer, this paper proposes Dex RDD program by modifying the kernel source of Spark, which avoids a lot of disk I/O consumption. PSS is very easy to operate, user just need to execute SQL extension functions in psql client to call the algorithm model in the Spark clusters. Proved by experiment, PSS platform has good accuracy, efficiency and scalability, and the extensibility reflected in the isolation physically between data storage and algorithm models where both are able to extend. User can add custom algorithm models according to the feature of data source. This paper achieves a real-time traffic prediction system based on PSS platform.
Keywords/Search Tags:Postgre SQL, UDF extension, Spark, Memory computing, Dex Middleware
PDF Full Text Request
Related items