Font Size: a A A

The Research And Implementation Of Real-time Large-scale Heter- Ogeneous Data Integration System

Posted on:2017-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:M W YeFull Text:PDF
GTID:2308330482481839Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of big data analysis technologies, many industries have realized the significant business value of big data. For instance, big data analysis technologies have been utilized by many modem enterprises to support their decision making. However, currently, some decision making systems may involve unlimited data sources in a variety of formats and each data source with large amount of data. Therefore, analysts wish to integrate data from multiple data sources and access these data via a unified interface.In this paper, we present a just-in-time big data cleaning system with the back-ground of healthcare industry. This system aims to merge data from a variety of sources and provide a standard data access interface with cleaned data for analysis tasks and decision making. For traditional big data integration systems, generally, it takes months to determine standard data access interface and data cleaning. The data cleaning is usually performed with an offline preprocessing step prior to making data available to analysis. To overcome those limitations, our system seeks a novel analysis-aware data cleaning method, which is different from traditional methods. Specifically, instead of an offline batch processing, the data cleaning step is performed according to analyst needs. Further, an automatic incremental mapping management platform is proposed to manage the schema mapping and data integration rules.In this paper, we present the whole design and main functional modules of our just-in-time big data cleaning system. The system mainly consists of two parts:real-time data crawling sub-system and incremental mapping management platform. The former obtains the latest data by making use of front end processor machine. The latter manages the schema mapping and data integration rules. Besides, we will also discuss the incremental heterogeneous data integration process and corresponding optimizations. Finally, extensive experiments demonstrate the effectiveness and efficiency of our proposed system will be presented.
Keywords/Search Tags:Big Data, Data Cleaning, Heterogeneous Data, schema mapping
PDF Full Text Request
Related items