Research And Implementation Of Data Hybrid Computing Platform Based On Spark

Posted on:2020-11-03

Degree:Master

Type:Thesis

Country:China

Candidate:S J Cao

Full Text:PDF

GTID:2428330575457077

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology,the traditional industry has gradually transformed into the digital enterprise,and the total amount of enterprise data resources has been increasing year by year.The value of data not only exists on its surface,but also processing and analytical techniques can create new value from it The National Health Institute has a number of national information systems which can produce the large scale and variety of data.Therefore,a big data hybrid computing platform is needed to supports multiple types of data sources and provide comprehensive one-stop data computing service which can meet the needs of medical research.At present,the industry's external big data platform is expensive to use and difficult to deploy and maintain,and state-level enterprises with high data confidentiality will have many concerns about using commercial software.In the aspect of data computing technology used in the platform,data join is mostly used in data merging,multi-table joint analysis and other operations,but the problem that data skew affecting the debt balance of computing unit has been a research bottleneck hindering its performance improvement.Different scenarios require different types of query technology solutions,which gives the user a complicated learning threshold for many tools,and each query requires manual judgment of the most suitable engine.In order to solve these shortcomings and problems,this paper conducts in-depth research on big data computing platform and its computing technology.The main research contents of this paper are as follows:1)Design and implement a data join optimization strategy based on Spark through in-depth study of large-scale data join process and its performance influencing factors,which can efficiently process large-scale data,supports equi-join and theta-join,and has good stability of performance for heavily skewed data;2)Research and implement a hybrid query engine(HQE)that can satisfy multiple query requirements at the same time.Its implementation is to split the Spark SQL and Apache Kylin modules,and then add a unified query parsing module and routing strategy to reconstruct;3)Research and implement a data hybrid computing platform integrating multiple data calculations on Spark according to the results of the first two studies.The platform mainly includes four modules:data management,data processing,data query and data factory.This paper is based on the actual application scenarios of the National Health Institute.From the functional to the technical needs analysis,templated medical research commonly used specific data processing operations and add to data processing module,design drag-and-drop front-end pages to faciitate the use of researchers.The hybrid data computing platform based on Spark in this paper provides one-stop data service for the whole life cycle of medical research,including from data management to data processing,query,analysis,and visual display of calculation results.At present,the platform has been deployed on The National Health Institute and used for daily scientific research.

Keywords/Search Tags:

Spark, big data platform, data query engine, data join

PDF Full Text Request

Related items

1	Self-Service Data Extraction System For Big Data Platform
2	Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data
3	Lav In Data Integration System Query Processing
4	Reseach On Optimizing Top-k Join Queries Based On Spark
5	On The Low Overhead Configuration Optimization Of In-memory Big Data Query Engine
6	Multi-Join Query Algorithm Research Over Data Streams
7	Research And Implementation Of Cross-platform Unified Big Data Intelligent SQL Query System
8	Research Of Federated Query Method For Linked Data Based On Semi-join
9	An Ad-hoc Query Engine Based On Spark SQL
10	Research And Implementation Of Multi-Way Join Query Processing Algorithms Over Big Spatial Data In Cloud Environment