With the rapid development of computer technology,the traditional industry has gradually transformed into the digital enterprise,and the total amount of enterprise data resources has been increasing year by year.The value of data not only exists on its surface,but also processing and analytical techniques can create new value from it The National Health Institute has a number of national information systems which can produce the large scale and variety of data.Therefore,a big data hybrid computing platform is needed to supports multiple types of data sources and provide comprehensive one-stop data computing service which can meet the needs of medical research.At present,the industry's external big data platform is expensive to use and difficult to deploy and maintain,and state-level enterprises with high data confidentiality will have many concerns about using commercial software.In the aspect of data computing technology used in the platform,data join is mostly used in data merging,multi-table joint analysis and other operations,but the problem that data skew affecting the debt balance of computing unit has been a research bottleneck hindering its performance improvement.Different scenarios require different types of query technology solutions,which gives the user a complicated learning threshold for many tools,and each query requires manual judgment of the most suitable engine.In order to solve these shortcomings and problems,this paper conducts in-depth research on big data computing platform and its computing technology.The main research contents of this paper are as follows:1)Design and implement a data join optimization strategy based on Spark through in-depth study of large-scale data join process and its performance influencing factors,which can efficiently process large-scale data,supports equi-join and theta-join,and has good stability of performance for heavily skewed data;2)Research and implement a hybrid query engine(HQE)that can satisfy multiple query requirements at the same time.Its implementation is to split the Spark SQL and Apache Kylin modules,and then add a unified query parsing module and routing strategy to reconstruct;3)Research and implement a data hybrid computing platform integrating multiple data calculations on Spark according to the results of the first two studies.The platform mainly includes four modules:data management,data processing,data query and data factory.This paper is based on the actual application scenarios of the National Health Institute.From the functional to the technical needs analysis,templated medical research commonly used specific data processing operations and add to data processing module,design drag-and-drop front-end pages to faciitate the use of researchers.The hybrid data computing platform based on Spark in this paper provides one-stop data service for the whole life cycle of medical research,including from data management to data processing,query,analysis,and visual display of calculation results.At present,the platform has been deployed on The National Health Institute and used for daily scientific research. |