Font Size: a A A

The Research Of Data Cleaning Algorithm Base On MapReduce

Posted on:2017-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2308330488450128Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the rapid development and widely application of Internet and other information, the amount of data generated in all areas is increasing. Effective Digging value from vast amounts of data mining has become a hot topic, and how to ensure the quality of the data at a high level of data-mining that is very important step. From data to information, from information to knowledge, from knowledge to intelligence, it is important to reflect its value, and data cleaning technology can provide clean, accurate, and concise data.Data cleaning takes a very important position in the data mining process, and it’s the key technology to get high qualified data. The Traditional technology of data cleaning is aimed at structuring and the small amount of data processing and analysis, and it is very complex for processing data from different sources, resulting in low-quality analytical results, so it is not suitable for handling massive amounts of data. At present, the data processing technology is mainly for unstructured data. Researchers have to summarize the existing cleaning mode, innovate data processing technology, combine with new theoretical model to make data cleaning technology catch up with the features of huge data era.Right decisions require reliable and accurate data that is correctly reflecting the actual situation. High qualified data is prerequisite for data analysis, and data cleaning can observe error and improving quality when store or migrate data. Rough Set is a practical theory for analyzing vague, imprecise, uncertain knowledge, proposed by Polish scholar Z.Pawlak, one of its major purpose is to find practical decision-making rules in massive data to acquire new knowledge. There are already a number of scholars have proposed algorithm on data cleaning, including data cleaning system matrix algorithm for complete information, equivalent matrices divides the contents of table and processing; rough rule concept indicate relevant rules of data cleaning program; but Mauricio.A.Hern’andez and others think that results of multi low-profile server node common calculation is more accurate than a high-profile server node, they also put forward the theory of equivalence for data cleansing. However, these examples did not mention data cleansing under the background of Massive Data, Mauricio.A.Hern’andez, those who have not made a clear practical platform to handle massive data. Therefore, this article is to dig deep with information missing data, adapted improved rough set theory, parallel algorithm MapReduce programming method, and applied the algorithm to an open-source platform of massive data processing---Hadoop.Finally, the algorithm is applied to a system with missing data, experimental results show that the algorithm is effective and feasible.
Keywords/Search Tags:Massive Data, Data Cleaning, MapReduce, Rough Sets, Attribute Reduction
PDF Full Text Request
Related items