Font Size: a A A

The Analysis And Improvement Research Of Knn-imputation Algorithm

Posted on:2011-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:L C HuangFull Text:PDF
GTID:2178360305978002Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a new and hot research area. Over the past 10 years, after the experiencing rapid development, people have developed many mature algorithms used for effectively handling of mass of data, and these algorithms and technology have good performances in the field of data mining. However, data mining technology applicated ecbolic handled the problems, most of then are real life, for the data used for algorithm, the data, generated and collected form the real-life, are usually filled with all kinds of noise, inconsistencies and missing, etc. Therefore, data pre-processing technology plays an increasingly important role in the process of data mining.In facing the problems of various realistic data, it is the most common that the data are missing. In most classical algorithms that have already been developed, it is very difficult that dealing with the missing data. Because the causes of data missing were more complex. In different applications, data generation process was different. So people usually supposed the data are collected according to the ideal state at the time of the design and development. But to mining with missing data, will have serious impact in mining process and the outcome,even lead to draw wrong models and conclusions. Therefore, there was a huge gap between the data mining algorithms and the actual available data.Against the problems such as data missing prevention, avoidance and dealing with, many scholars at home and abroad have done related research. These related studies Absorbed the outcome of Statistics, Machine Learning, Probability, etc. Many algorithms, developed in the data imputation fields, have been proved to be very successful by experimental and industrial applications. In the usual sense, even if the effect of the imputation algorithm is general, the imputation algorithm's role for the improvement of mining algorithms and increase the effect of mining are obvious.This article is the analysis and improvement of KNN algorithm, one of the widely used algorithms with scalability and adaptability. KNN algorithm is a generalized form of NN algorithm, NN algorithm (Nearest neighbor algorithm) was first proposed by Cover and Hart in 1967, proposed for the classification at the earliest. The basic idea is to use class labels closed cases which already know to classify unknown cases. Since it is easy to understand and program, and is applicable to a wide range, after it was proposed,the algorithm have been widely used in the fields of Classification, Cluster, Information Retrieval and Inquiry, Missing Data Imputation. KNN imputation algorithm is an improved version of NN imputation algorithm on the missing data imputation.KNN imputation algorithm used the data points near the missing point to estimate value of the missing data and imputing it. Traditional KNN imputation Algorithm has many deficiencies. For example, computation was considerable. There are many improved algorithms for KNN imputation. They usually focus on the improvement about distance measure, distance calculation, the calculations of the imputation value and the storage index of the results.From the literature, imputation method used a particular method or a sequence of several algorithms. Order to imput the entire data set, but the missing data classify proposed in this paper, even for the same data set, for different missing value, the imputation method is different. The algorithm, especially based on density and neighbors in particular algorithms, should be considered to classify the missing data and imputing them using the appropriate methods, rather than simply using a algorithms to deal with all of the data with the same treatment. The main work of This paper is, in the framework of classification imputation, to propose K-1NN algorithm and CNN algorithm based on the geometric center, and combined with partially imputation, proposes two types missing points are not suitable to be imputed, and these two classes are separated in another part. Based on this, this paper propose two imputation algorithms:PKNN and PCNN imputation algorithms. The experiments confirm that the imputation classification method and partially imputation strategies effectively improve the accuracy of KNN imputation algorithm.
Keywords/Search Tags:Missing data imputation, partially imputation, classifying imputation, KNN imputation algorithm
PDF Full Text Request
Related items