With the rapid development of information society, information and data has increaseddramatically. How faster and more accurate digging out the user information of interest fromthese vast amounts of data, this becomes increasingly important. Data cleansing as animportant part of data mining, is also facing a lot of pressure on the massive data cleaning.Data cleaning consists of three main tasks: Incomplete data cleaning, error data cleaningand duplicate data cleaning. For incomplete data and duplicate data cleaning, now there are alot of mature programs which can achieve the desired objectives. But the error data cleaning,as its different definitions, there will be many different error data cleaning methods, there is abig disadvantage in the general terms, especially for massive data cleaning, the correspondingsolutions are also relatively few.This paper treats the energy data based on energy monitoring platform as basic research,experimental subjects. For Energy monitoring platform, the network infrastructure isconstituted by a large number of sensors, such as temperature and humidity sensors, a varietyof gauges, carbon dioxide sensors. The data of Energy monitoring platform is obtained fromthese sensors through a variety of network protocols, the main features of the data aremulti-category and large amount of data. But when data acquisition, by the reasons ofequipment, the network, it will inevitably have some incorrect data, for example, at aparticular moment, the collected data have the phenomenon of the sudden increase orplummeted. Such data in the database of the energy monitoring platform will have someimpact for energy-saving strategies. Therefore, in order to reduce its negative impact, thispaper propose an appropriate cleaning algorithm for error data (also known as abnormal dataor outlier data) cleaning, and presented Hadoop Distributed data cleansing method for thisfeature of mass data, use Hadoop technology for isolated point data mining and cleaning,implement the distributed data cleaning based on the Hadoop Distributed technology, for Theultimate guarantee of the correctness of the data mining and decision-making.This article first briefly introduced the topic of the research situation, the Hadoopplatform and data cleaning, outlier mining algorithms, and describe typical data cleansingmethod for contrast, then raise a Hadoop Distributed data cleansing method, which is adetailed description of the program design, and then on the basis of this, it proposed adistributed isolated-point mining algorithm based on Hadoop, and use the Map/Reducedistributed technology for completing the algorithm, then achieve the goal of distributed datacleaning for isolated point, and finally compare the effect of this algorithm programs with other algorithms of isolated point cleaning, the experimental results show that the proposeddistributed data cleaning method can improve the accuracy, flexibility and rapidity of datacleaning. |