Font Size: a A A

Design And Implementation Of Data Preparation Platform For Multi-source Big Data

Posted on:2021-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:B B ChenFull Text:PDF
GTID:2518306308467864Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of enterprise business has resulted in the continuous accumulation of massive heterogeneous sources of structured and unstructured data.With the improvements of data acquisition methods and data processing technologies,even if a good design and planning is carried out at the beginning of the establishment for the information system,it cannot be guaranteed that the quality of the stored data can meet the requirements of the data analysts as time went by and business changed.Because enterprises cover many businesses,and each business would generate data with a specific structure,the data structures of different businesses were different and the storage methods were diverse.As a result,enterprises have big data from multiple sources,referred to as multi-source big data.When enterprises conducted data analysis tasks on multi-source big data such as data mining,firstly they needed to preprocess the data,that is,the data preparation process.But data preparation was a very time-consuming and labor-intensive work,and it often required analysts to have coding capabilities.Re-editing of existing data preparation process and reuse of similar data preparation processes were difficult.These problems significantly increased the workload of data analysts.This article designed and implemented a data preparation platform for multi-source big data,providing users with various tools to access different big data sources,establish data preparation process,display and save the data preparation results.Compared with coding or manually performing the data cleaning process,the platform used a user-friendly visual interface to replace the manual data cleaning process for the step with a user-editable set of data preparation steps,and the user edited customization data preparation process to reduce coding or manual operation.At the same time,the user could edit and execute that process by establishing a process file.The GBDT(Gradient Boosting Decision Tree)-based missing data processing algorithm was proposed to improve the accuracy of time series missing data filling.This article first introduces the concept and related background of multi-source big data and data preparation,the background and significance of the data preparation platform for multi-source big data,and briefly introduces the related technologies used during the platform implementation.Then based on the key application scenarios,the demands of the platform are analyzed to clarify the system's functional requirements.After that,the key problems that need to be solved during implementation are analyzed and the corresponding solutions are proposed:In order to solve the problem of accessing multi-source big data,the unified view of multi-source big data is established,it uses a hybrid ontology-based XML(Extensible Markup Language)method to establish the mapping of data sources and views then the data source of the data preparation process is obtained;In order to solve the problem of the mapping between the user view and the data preparation process model,by establishing the component structure model,the process component definition based on the MVC(Model-View-Controller)structure and the data preparation process mapping and modeling are used to obtain the process documents and models;In order to improve the accuracy of time series missing data processing in the data preparation,this paper proposes a missing data processing step based on GBDT,which improves the accuracy of missing data processing by combining GBDT regression prediction filling and statistical value filling method.Based on the solution of key problems,the prototype system of data preparation platform for multi-source big data was designed and implemented,and the platform functions were tested.Finally,it summarizes the work and points out the shortcomings and the direction of future improvement.
Keywords/Search Tags:multi-source big data, data preparation, missing data processing, GBDT regression analysis
PDF Full Text Request
Related items