As the Big Data industry grows,the growing volume of data being referenced and analysed,the reliance on and requirements for Big Data components are increasing,placing new demands on data computation and storage.Data warehouses are being replaced by data lake technology due to the high coupling of computation and storage and the inability to store and query unstructured files,while data lakes that lack established data governance processes can easily turn into data lake swamps.The Lake Warehouse is a new architectural system that combines the advantages of both.At present has not yet formed a unified standard for the lake warehouse all-in-one,the major cloud service vendors are actively exploring,have formed their own commercialized lake warehouse architecture system.In the open source system level,has not yet formed a lake warehouse integrated architecture system,in the exploration process is mainly faced with the following problems:1)data lake components are more,and there are compatibility problems between components,while the lack of a complete release process to achieve data lake cloud original biochemistry.No constraints on data entry into the lake,easy to form a data lake swamp.2)Multiple service systems constrain the principle of simplicity and ease of use of the data lake platform.The lack of dynamic balance between the task of building a warehouse on the lake and computing resources leads to low data processing parallelism.3)In the process of interacting with the data lake,the command method operation process is complex and cumbersome,raising the learning and data output costs of the platform.To address the above issues and challenges,this paper focuses on the design and implementation of an integrated big data storage system for cloud native resources to build data lakes,data portal unification and ETL task scheduling management,and cloud native lake and warehouse,which are divided into the following three main elements:1)Design and implement cloud native resources to build data lake method,rely on cloud native environment,containerise data lake components,organise containers in a production line way to realise data lake cloud native,and solve compatibility and hard-to-deploy problems.Improve the portability and flexibility of the data lake through the storage and calculation separation scheme.Develop a complete data writing system to ensure the standardisation of data lake metadata and data assets.2)Design and implement data portal unification and ETL task scheduling management to establish a unified portal for multiple computing engines and simplify the system service architecture.Refine ETL tasks into computational jobs,allowing equal access to cloud native computing resources between parallel jobs.On the data lake building hierarchy model,define snapshot timing relationships to improve data error correction capability.3)Design and implement a cloud native lake warehouse integrated big data storage system,using container technology and container orchestration technology to complete the data lake cloud native,through the system modular components to establish a warehouse building process system on the lake,and ultimately form clear and bright data in the data lake to provide data sources for BI analysis and business decisions.Finally,this paper finally realises an integrated big data storage system for cloud native lakes and warehouses,which provides a full-link solution for developers to build data lake resources,collect data into the lake and build warehouses on the lake.The system is applied in the national key R&D project "Research and development of technology consulting technology and service platform based on big data",which verifies the effectiveness and practicality of the research content of this paper. |