A Big Data Query Engine System Based On Dataflow

Posted on:2016-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:G R Liang

Full Text:PDF

GTID:2308330461955238

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The size of data being collected and analyzed in Baidu is growing rapidly, Baidu introduced popular open source Apache Hadoop in 2009 for big data distributed processing. The MapReduce framework of Hadoop simplifies the programming for distributed large-scale data computing, but its programming model is low level and cumbersome for complex logic processing. Moreover, most analysts are used to SQL to do data analysis.Hence, Baidu needs to develop a distributed Query Engine system based on rational algebra for big data analysis.Currently, there are many Query Engine systems in industry, such as Apache Hive, Apache SparkSQL, Google Dremel, etc. These systems adopt SQL as main query interface, and provide great extensibility and performance. Meanwhile, these systems compile SQL to query plan based on exsisted distributed computing framework, and make optimizations based on traditional RDBMS rules.Though these Query Engine systems have many advantages, they can’t meet all of the analysis requirements in Baidu. Baidu decided to develop its rational Query Engine based on internal distributed dataflow computing framework. The goal of the system is to provide reusable components for structured big data processing with high performance, high availability, and high extensibility. The system is built on ideas from classical compiler design and it has four main layers:The first layer is the frontend layer, which parse SQL to AST and perform semantic analysis. The second layer is the intermediate representation layer, which implements IR language for rational operation representation. The third layer is the pass framework, which abstracts optimizing rules into pass for logical plan analysis and transformations. The forth layer is the runtime layer, which compile IR operators into physical operators.The system has been in production and its function has covered 80% big data batch processing in Baidu, and its average performance is 30% faster than Hive.

Keywords/Search Tags:

Query Engine, Big Data, Dataflow, Distributed Computing

PDF Full Text Request

Related items

1	Research On Key Technologies Of Distributed Rank-aware Query Processing
2	Distributed Joins And Optimization For BIG Table Based On Database OceanBase
3	Hybrid Graph Query And Graph Computing Engine For Distributed Graph Database
4	Design And Implementation Of Acceleration Method For Massive Distributed In-Memory Database Query Engine
5	A Distributed Computing Framework to Manage, Query, and Analyze Big Geospatial Data for Urban Studies - Case Studies with Urban Heat Island and Tourist Movement Pattern Minin
6	Dataflow Runtime System On Heterogeneous Convergence Platform
7	Design And Implementation Of Distributed Graph Computing Engine
8	The Design And Implementation Of A Data Warehouse Engine Based On Hadoop
9	Research And Implementation Of Distributed Twig Query Processing Over Massive XML Documents In The Cloud
10	A dataflow-based software integration model in parallel and distributed computing and applications