Research And Realization Of Optimization Technology For Big Data Cleansing

Posted on:2017-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:N N Li

Full Text:PDF

GTID:2348330509457104

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the information data is full of everywhere in our life, and the big data era makes the data more dazzling. When the crazy people draw nutrition from the data, the data quality problems are gradually exposed, such as data redundancy, inconsistency of data, error data, missing data. So data quality is particularly important in the age of big data, massive information data cleaning and fault management system emerging as the times call on. The existing information systems for massive data processing are generally running on Hadoop, the most popular open source parallel framework, but due to various reasons, often inefficient.Data quality issues will result in lethal effects of big data applications, so it is need to clean the big data with the problem of data quality. Map Reduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning. However, due to the lack of effective design, redundant computation exists in the cleaning process based on Map Reduce, resulting in decreased performance. Therefore, the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency.In this paper we have made the following contributions. Firstly, through research, we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniques-- optimization techniques based task combination. By merging redundant computation and several simple calculations for the same input file, we can reduce the number of rounds of Map Reduce system thereby reducing the running time, and ultimately achieve system optimization. Secondly, the three-tier architecture of FLI proposed in this paper allow us to analysis software systems in the way of system optimization, together with the optimizing method based on task merging, we have established an optimization theory of data cleaning which contains all the things from system analysis to detailed implementation. Finally, in this paper, some complex modules of data cleaning process has been optimized, respectively entity recognition module, inconsistent data recovery module, and the module of missing values filling. The experimental results show that the proposed strategy in this paper can effectively improve the efficiency of data cleaning.

Keywords/Search Tags:

multi-task optimization, massive data, data cleaning, Hadoop, Map Reduce

PDF Full Text Request

Related items

1	Research And Implementation Of Duplicate Data Clean-up Model Based On Hadoop
2	The Research And Optimization Of Job Schedule Algorithm In Hadoop
3	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop
4	The Research And Application Of Storage And Mining Methods For Massive In-Vehicle Information
5	Research On The Cleaning Method Of Industrial Big Data Based On Hadoop
6	Performance Optimization Of A Massive Data Query And Analysis System On Hadoop
7	Research On Data Cleaning Of Website Based On Hadoop Architecture
8	Research And Application Of Massive Data Processing Model Based On Hadoop
9	The Research And Implementation Of Entity Identification Subsystem In The Data Management System Of Quality And Quantity
10	Hadoop Distributed Data Cleaning Method