The Research Of Data Cleaning Algorithm Base On MapReduce

Posted on:2017-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:F Chen

Full Text:PDF

GTID:2308330488450128

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

With the rapid development and widely application of Internet and other information, the amount of data generated in all areas is increasing. Effective Digging value from vast amounts of data mining has become a hot topic, and how to ensure the quality of the data at a high level of data-mining that is very important step. From data to information, from information to knowledge, from knowledge to intelligence, it is important to reflect its value, and data cleaning technology can provide clean, accurate, and concise data.Data cleaning takes a very important position in the data mining process, and it’s the key technology to get high qualified data. The Traditional technology of data cleaning is aimed at structuring and the small amount of data processing and analysis, and it is very complex for processing data from different sources, resulting in low-quality analytical results, so it is not suitable for handling massive amounts of data. At present, the data processing technology is mainly for unstructured data. Researchers have to summarize the existing cleaning mode, innovate data processing technology, combine with new theoretical model to make data cleaning technology catch up with the features of huge data era.Right decisions require reliable and accurate data that is correctly reflecting the actual situation. High qualified data is prerequisite for data analysis, and data cleaning can observe error and improving quality when store or migrate data. Rough Set is a practical theory for analyzing vague, imprecise, uncertain knowledge, proposed by Polish scholar Z.Pawlak, one of its major purpose is to find practical decision-making rules in massive data to acquire new knowledge. There are already a number of scholars have proposed algorithm on data cleaning, including data cleaning system matrix algorithm for complete information, equivalent matrices divides the contents of table and processing; rough rule concept indicate relevant rules of data cleaning program; but Mauricio.A.Hern’andez and others think that results of multi low-profile server node common calculation is more accurate than a high-profile server node, they also put forward the theory of equivalence for data cleansing. However, these examples did not mention data cleansing under the background of Massive Data, Mauricio.A.Hern’andez, those who have not made a clear practical platform to handle massive data. Therefore, this article is to dig deep with information missing data, adapted improved rough set theory, parallel algorithm MapReduce programming method, and applied the algorithm to an open-source platform of massive data processing---Hadoop.Finally, the algorithm is applied to a system with missing data, experimental results show that the algorithm is effective and feasible.

Keywords/Search Tags:

Massive Data, Data Cleaning, MapReduce, Rough Sets, Attribute Reduction

PDF Full Text Request

Related items

1	Study And Application Of Attribute Reduction Algorithms Based On Rough Sets
2	Research Of Data Reduction And Application Based On Rough Sets
3	Research On Attribute Reduction Algorithms Based On Rough Sets Theory
4	Research On Data Reduction Based On Rough Sets And Extension Of Rough Set Models
5	Rough Association Rules Algorithm Research With Big Data
6	Research On Attribute Reduction And Concept Drift With F-rough Sets
7	Rough Set Attribute Reduction Algorithm Based On GA And MapPreduce
8	Research On Mixed Data Knowledge Acquisition Method Based On Neighborhood Multi-granularity Rough Sets
9	Research On Attribute And Attribute Value Reduction Method Based On Rough Sets
10	Based On Rough Set Attribute Reduction Algorithm Of Data Mining To Improve Research