Font Size: a A A

The Class-Mean Method And Its Extensions To Handling Incomplete Data In Data Mining

Posted on:2011-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y X JiFull Text:PDF
GTID:2178360308460627Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the developments of data collection and storage technology and the increasingly demanding of information accessing, data mining has become an indispensable tool in many industries. In the whole process of data mining, it is no doubt that data preprocessing is a key step because satisfactory results can only be obtained by using reliable and accurate data. In general, data preprocessing accounts for about 60% of the workload in the entire mining process. Therefore, the research on data preprocessing has an important theoretical and practical value. Processing of incomplete data is not only the most common issue but also the basic problem in the data preprocessing, and very few real database is complete or having no missing. Therefore, how to handle these missing data has catched much attention from a variety of researchers and become a hot problem to need to be solved.Methods for dealing with incomplete data can be classified into the following categories: a) removing tuple; b) filling missing data; and c) no handling. Han and Zhang think that the most commonly used method of dealing with missing values is the method of imputation according to the using frequency and the research extent. This method should be made much account and research both technically and theoretically. This paper is focused on how to impute missing data simply and incomplete data in data mining effectively. First of all, we summarize systematically and compare several commonly used methods to impute the missing data. Secondly, to aim at the disadvantages of the class-mean method, we propose adjustment techniques of the weighted class-mean method and the fuzzy class-mean method. Our main results are as follows:1. By introducing to the related knowledge and understanding the mining process, we explain that data preprocessing is one of the most important steps to the mining process. The treatment of incomplete data in data preprocessing stage is discussed in detail,included its causes, prevention and postprocessing. Several commonly used postprocessing imputation methods are analyzed. Principles of various methods and their advantages and disadvantages are investigated.2.The main shortcoming of commonly used class-mean methods in the simple filling strategy is that incomplete information may result in wrong group, and imputation value is far from the true value, which reduce the variation among the variables.We propose an improved method, which can give different weights to each group in order to adjust the imputation values and make the results to the real values as close as possible. Also, we can make the imputation values different in the same group according to different weights, and achieve ultimately the purpose improving variation among the variables. As for the subjective weights in the adjustment class-mean method, we put forward the fuzzy class-mean imputation which can overcome the above two faults.3.Simulation experiments on three methods are performed with the R software, and the feasibility and effectiveness of the two improved methods are validated by comparing them with previous methods in the literature.
Keywords/Search Tags:Data Mining, Incomplete Data, Class-mean Imputation Method, Weighted Adjustment, Fuzzy Techniques, Membership
PDF Full Text Request
Related items