Self-Service Data Extraction System For Big Data Platform

Posted on:2020-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:L X Huang

Full Text:PDF

GTID:2428330572473609

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Currently,Internet data is growing rapidly and will continue to grow,which makes the extraction and analysis of large-scale data become a hot issue for the enterprise.In the absence of self-service data extraction tools,time and labor costs have become a limiting factor in business expansion.Therefore,designing an efficient self-service data extraction system is critical to the development of the enterprise.This paper adopts Hive as a data warehouse solution.However,in the parallel processing of massive data,the data network transmission cost produced by the join operation becomes a performance bottleneck.Therefore,improving the efficiency of the join query in Hive plays an important role in improving the performance of the self-service data extraction system for big data platform.This paper proposes an innovative method to improve the efficiency of the join query in Hive,namely the learn-to-query architecture.Users only need to configure on the visual operation interface,and the learn-to-query architecture can generate the best query plan.The main research contents and research results of this paper are as follows:1)A query cost prediction model is proposed to predict the execution time of the query in Hive.And its prediction result is considered as a metric,which is used for the selection of the optimal query plan and the timely adjustment of the long query task.In this paper,the deep learning technology LSTM is used to predict the query cost.Based on the previous work,an improved query cost prediction model is designed,which is more suitable for query in Hive under big data environment.Moreover,the effectiveness of the improved model is verified by experimental analysis and comparison.2)In the learn-to-query architecture,this paper proposes a graph-based SQL generation model for transforming user configuration data into a Hive-based query plan,and combines it with the query cost prediction model to pick out the optimal query plan.Finally,it is verified that the learn-to-query architecture proposed in this paper can significantly improve the efficiency of the join query in Hive.3)Based on the learn-to-query architecture,a complete self-service data extraction system is designed and implemented.The system can automatically generate a timing data extraction task based on the user's configuration data.The system also provides other functions,such as data rights management,log monitoring,task management,audit management,and temporary table management,to realize automatic management of the data extraction task.

Keywords/Search Tags:

big data, join query, SQL generation engine, query cost prediction, LSTM

PDF Full Text Request

Related items

1	Multi-Join Query Algorithm Research Over Data Streams
2	Lav In Data Integration System Query Processing
3	Research On Dynamic Programming Based Join Tree Generation Algorithms
4	Research On Data Query Optimization Algorithm Of Distributed Database
5	Hadoop-based Geospatial Data Storage And Query Technology
6	Research Of Query Optimization Based On Join Index
7	Application And Research On Multi-join Query Optimization Of Database Based On GA
8	Fuzzing Methods For Query Processing Functionality Of Analytical Databases
9	Research On Spatial Join And Variants Of Nearest Neighbor Query
10	Join Prpcessing And Optimizing On Large Clusters