Font Size: a A A

Self-Service Data Extraction System For Big Data Platform

Posted on:2020-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:L X HuangFull Text:PDF
GTID:2428330572473609Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Currently,Internet data is growing rapidly and will continue to grow,which makes the extraction and analysis of large-scale data become a hot issue for the enterprise.In the absence of self-service data extraction tools,time and labor costs have become a limiting factor in business expansion.Therefore,designing an efficient self-service data extraction system is critical to the development of the enterprise.This paper adopts Hive as a data warehouse solution.However,in the parallel processing of massive data,the data network transmission cost produced by the join operation becomes a performance bottleneck.Therefore,improving the efficiency of the join query in Hive plays an important role in improving the performance of the self-service data extraction system for big data platform.This paper proposes an innovative method to improve the efficiency of the join query in Hive,namely the learn-to-query architecture.Users only need to configure on the visual operation interface,and the learn-to-query architecture can generate the best query plan.The main research contents and research results of this paper are as follows:1)A query cost prediction model is proposed to predict the execution time of the query in Hive.And its prediction result is considered as a metric,which is used for the selection of the optimal query plan and the timely adjustment of the long query task.In this paper,the deep learning technology LSTM is used to predict the query cost.Based on the previous work,an improved query cost prediction model is designed,which is more suitable for query in Hive under big data environment.Moreover,the effectiveness of the improved model is verified by experimental analysis and comparison.2)In the learn-to-query architecture,this paper proposes a graph-based SQL generation model for transforming user configuration data into a Hive-based query plan,and combines it with the query cost prediction model to pick out the optimal query plan.Finally,it is verified that the learn-to-query architecture proposed in this paper can significantly improve the efficiency of the join query in Hive.3)Based on the learn-to-query architecture,a complete self-service data extraction system is designed and implemented.The system can automatically generate a timing data extraction task based on the user's configuration data.The system also provides other functions,such as data rights management,log monitoring,task management,audit management,and temporary table management,to realize automatic management of the data extraction task.
Keywords/Search Tags:big data, join query, SQL generation engine, query cost prediction, LSTM
PDF Full Text Request
Related items