With the coming of information age, human are confronted with increasing data and information in different fields. At the same time, these data are developing in surprisingly speed. In order to improve work efficiency and life quality, people must obtain the valuable information hidden in these data. So, researches that mining knowledge from databases are started. However, as well known, there are many issues in databases, such as redundant data, missing data, uncertain data, inconsistent data, and so on, they are the barriers to knowledge discovery. Therefore, it is important to preprocess data before knowledge discovery from databases.And this paper focuses on the data preprocessing in data mining, especially on the data cleaning, and the data preprocessing functions are implemented also in Data Mining Laboratory Platform (DMLab).Firstly, the knowledge of data preprocessing is described generally and particularly, and the research background, concept and the research status of main preprocessing techniques are introduced. Then, the existing data preprocessing techniques are analyzed deeply, which involved data cleaning, data sampling, data transformation and data reduction. The paper lays a strong emphasis on the missing data imputation techniques, and many imputation algorithms are studied and discussed in detail, the imputation algorithm based on clustering technique is proposed. Finally, the data preprocessing module in Data Mining Laboratory Platform is implemented based on many techniques discussed earlier, and the module contains data cleaning, data sampling, data transformation and data reduction functions respectively.The paper introduces basic knowledge and algorithms of data preprocessing technologies, especially missing data cleaning, and discusses the merit and drawbacks of missing data cleaning techniques objectively. Many data preprocessing techniques that applied widely at present are studied, and the design and implementation of data preprocessing module functions in DMLab system were achieved based on the studies. Not only implement the basic preprocessing algorithms according to system demand, but a new methodapplying clustering algorithm for imputation is proposed, at the end the test result and conclusion are provided.The leading creative point is the imputation algorithm proposed of missing data based on the cluster technique. |