Font Size: a A A

Research And Implementation Of OAA Data Engine Based On The Model Of Aggregation On Supply And Demand

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:H B XiaoFull Text:PDF
GTID:2428330611465587Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,how to manage data efficiently becomes more and more important.The difficulty of data management,on the one hand,lies in the rapid expansion of data size,the amount of data that the system has to deal with has far exceeded the load of a single machine.On the other hand,the data format is different among different business departments,the data source is heterogeneous and the data cannot be shared,so it is difficult to integrate and manage these data.In addition,in order to ensure the high availability of data,it is often necessary to store multiple copies of the same data in multiple servers to prevent data loss.In this case,how to effectively ensure the data consistency between the source server and the backup server is also a link that cannot be ignored in data management.Starting with the research background,this paper focuses on the research of two functional modules of file incremental synchronization and heterogeneous data integration,proposes OAA(Object Access Agent)data engine based on the Model of Aggregation on Supply and Demand(abbreviated as MASD).This engine mainly studies from the following three aspects:(1)Implementation of CDC-based(Content-Defined Chunking-based)file incremental synchronization method.Based on the algorithm flow and principle of the Rsync algorithm,in view of its shortcomings of excessive computing resource consumption,this paper proposes a general method for file incremental synchronization based on CDC.This method takes advantage of the CDC algorithm's strong ability to byte shift resistance,greatly reducing the computational elasticity,so that the algorithm will not cause the rapid increase in computing resource consumption with the increase of the incremental data size.In the high-speed network environment,compared with the Rsync algorithm,this method has less computing resource consumption,lower computing elasticity,and stronger practicability.(2)Implementation of heterogeneous data integration function.Driven by the core configuration file and based on the Spark distributed computing framework,this paper divides the function of heterogeneous data integration into five modules from the process: data extraction,data join,data conversion,data injection and timing synchronization.In addition,we have optimized some unfriendly designs in the Spark distributed computing framework,allowing data conversion through dynamic registration of UDFs and data injection based on distributed concurrent programming,further extending the native functions of the Spark framework,which makes OAA data engine more practical and flexible in the field of heterogeneous data integration(3)The application and function extension of MASD,this model organizes the communication between services with "supply and demand relationship",supports the dynamic addition of services and function expansion.This paper encapsulates the file incremental synchronization function and heterogeneous data integration function,and mount them to the model in the form of PIP(Plug-in Processor),forming OAA data engine,which allows other users to call related services through this engine.In addition,for the lack of data processing function of MASD,we add a data caching function based on the message system to improve the operability of data within MASD.
Keywords/Search Tags:MASD, File Incremental Synchronization, Heterogeneous Data Integration, Content-Defined-Chunking, Big Data Computing
PDF Full Text Request
Related items