Font Size: a A A

Research And Realization Of Optimization Technology For Big Data Cleansing

Posted on:2017-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:N N LiFull Text:PDF
GTID:2348330509457104Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the information data is full of everywhere in our life, and the big data era makes the data more dazzling. When the crazy people draw nutrition from the data, the data quality problems are gradually exposed, such as data redundancy, inconsistency of data, error data, missing data. So data quality is particularly important in the age of big data, massive information data cleaning and fault management system emerging as the times call on. The existing information systems for massive data processing are generally running on Hadoop, the most popular open source parallel framework, but due to various reasons, often inefficient.Data quality issues will result in lethal effects of big data applications, so it is need to clean the big data with the problem of data quality. Map Reduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning. However, due to the lack of effective design, redundant computation exists in the cleaning process based on Map Reduce, resulting in decreased performance. Therefore, the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency.In this paper we have made the following contributions. Firstly, through research, we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniques-- optimization techniques based task combination. By merging redundant computation and several simple calculations for the same input file, we can reduce the number of rounds of Map Reduce system thereby reducing the running time, and ultimately achieve system optimization. Secondly, the three-tier architecture of FLI proposed in this paper allow us to analysis software systems in the way of system optimization, together with the optimizing method based on task merging, we have established an optimization theory of data cleaning which contains all the things from system analysis to detailed implementation. Finally, in this paper, some complex modules of data cleaning process has been optimized, respectively entity recognition module, inconsistent data recovery module, and the module of missing values filling. The experimental results show that the proposed strategy in this paper can effectively improve the efficiency of data cleaning.
Keywords/Search Tags:multi-task optimization, massive data, data cleaning, Hadoop, Map Reduce
PDF Full Text Request
Related items