Font Size: a A A

A Big Data Query Engine System Based On Dataflow

Posted on:2016-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:G R LiangFull Text:PDF
GTID:2308330461955238Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The size of data being collected and analyzed in Baidu is growing rapidly, Baidu introduced popular open source Apache Hadoop in 2009 for big data distributed processing. The MapReduce framework of Hadoop simplifies the programming for distributed large-scale data computing, but its programming model is low level and cumbersome for complex logic processing. Moreover, most analysts are used to SQL to do data analysis.Hence, Baidu needs to develop a distributed Query Engine system based on rational algebra for big data analysis.Currently, there are many Query Engine systems in industry, such as Apache Hive, Apache SparkSQL, Google Dremel, etc. These systems adopt SQL as main query interface, and provide great extensibility and performance. Meanwhile, these systems compile SQL to query plan based on exsisted distributed computing framework, and make optimizations based on traditional RDBMS rules.Though these Query Engine systems have many advantages, they can’t meet all of the analysis requirements in Baidu. Baidu decided to develop its rational Query Engine based on internal distributed dataflow computing framework. The goal of the system is to provide reusable components for structured big data processing with high performance, high availability, and high extensibility. The system is built on ideas from classical compiler design and it has four main layers:The first layer is the frontend layer, which parse SQL to AST and perform semantic analysis. The second layer is the intermediate representation layer, which implements IR language for rational operation representation. The third layer is the pass framework, which abstracts optimizing rules into pass for logical plan analysis and transformations. The forth layer is the runtime layer, which compile IR operators into physical operators.The system has been in production and its function has covered 80% big data batch processing in Baidu, and its average performance is 30% faster than Hive.
Keywords/Search Tags:Query Engine, Big Data, Dataflow, Distributed Computing
PDF Full Text Request
Related items