Font Size: a A A

New Technologies For Imputation And Classification Based On Nn Approach

Posted on:2011-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:M L ZhuFull Text:PDF
GTID:2178360305477848Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In today's Internet age, massive information processing has become a major need in China's economic development process. Nearest neighbor method is one of the most important theories and techniques in massive information processing. Using the known nearest points to estimate and approximate user queries, it provides a simple, easy understanding, and effective theory and technology for the processing and sharing of massive information. This thesis studies new technologies and algorithms for imputation and classification based on nearest neighbor method.First of all, this paper studies the k nearest neighbor algorithm from the applications of missing data imputation and data classification. We describe the basic principle of k nearest neighbor algorithm in great detail and analyze its advantages/disadvantages. Also we summarize some improvement which is commonly used. On this basis, the paper presents new strategies to improve some of the shortcomings for k nearest neighbor algorithm and verifies the effectiveness by the theory and experiment for gaining more imputation (classification) accuracy.On one hand, this paper studies new theories and algorithms of nearest neighbor imputation. As the k-Nearest Neighbor Imputation (kNNI) algorithm is often biased in choosing the k nearest neighbors of missing datum, a new imputation method is put forward, Quadrant-Encapsidated-Nearest-Neighbor based Imputation method (QENNI), for missing values. The algorithm uses the quadrant nearest neighbors (points of the encapsulant) around a missing datum to impute the missing datum. It is not biased in selecting nearest neighbors. In addition, this paper analyzes three possible weighted methods of the Shell Neighbor Imputation (SNI) algorithm[1,2] and sums up that duplicate neighbors selection can help to improve the imputation effect of SNI. Also we obtain that the Frequency-Distance Weighted Shell Neighbor Imputation method (fdwSNI) has the best filling effect. For missing data imputation, the imputation algorithms are important, but no doubt a good evaluation method can provide effective guidance for the algorithm selection. This paper points out that the commonly used indicator RMSE tends to serious imputation errors by a specific example. As it is, we propose a new evaluation method called goodness to overcome the defect. Even there are a few serious imputation deviations, goodness still can come to ideal conclusion. On the other hand, this paper establishes a Shell Neighbor Classification method(SNC). It is not biased in selecting nearest neighbors. The SNC algorithm is not sensitive to distance metrics and performs better at classification accuracy on large data sets. In practical data mining applications, the quality of data is usually poor or incomplete. So how to develop noise robustness mining algorithms is a practical and challenging work. Eliminating noises is often difficult and expensive. Also reducing the historical data (even noises) for the completeness of the information will lead to the analysis of data capacity greatly reduced. The k-nearest neighbor (kNN) classification is based on the distance and is thus locally optimum. It does not take into account the partial or whole data distribution that can impact on the classification accuracy. Therefore, the kNN classification is sensitive to noisy data. This paper designs a classification algorithm by incorporating the class distributions in k-nearest-Neighbor, Cluster and Training set, called NCT. This combination assists in enhancing the noise tolerance of classification, i.e., called noise-robust classification. Experimental results show that the proposed algorithm NCT not only has better classification performance, but also has good robustness in the noisy environment. In the environment without noises, NCT algorithm is slightly better than kNN; in the environment with noises, the NCT method is significantly better than traditional kNN at classification accuracy. And this noise tolerance is clearly distinguished when the noise ratio is increased.Finally, clustering and global information which is introduced to the NCT algorithm is varied to other combinational forms. The experimental results show in noisy environments, all the combinations could improve the classification result more or less, but the NCT algorithm of linear interpolation combination improves the classification accuracy most.In short, the main innovations of this paper can be summarized as follows:As the kNNI algorithm is often biased in choosing the k nearest neighbors of missing datum, a new imputation method is put forward, Quadrant-Encapsidated-Nearest-Neighbor based Imputation method (QENNI), for missing values.Propose a new evaluation method called goodness. When there are a few serious imputation deviations, goodness evaluation is better than RMSE.Establish a SNC classification model which is not biased in selecting nearest neighbors.?It is not sensitive to distance metrics and performs better at classification accuracy on large data sets.Design a classification algorithm by incorporating the class distributions in k nearest neighbor, cluster and training set, called NCT. It assists in enhancing the noise tolerance of classification, i.e., called noise-robust classification. In order to verify the effectiveness and validity of the proposed algorithms and strategies, we conduce many experiments on real datasets. Experimental results show that the proposed algorithms QENNI, SNC and NCT are better than k nearest neighbor algorithm. Particularly the NCT method is significantly better in noisy environments.
Keywords/Search Tags:k Nearest Neighbor Algorithm, Shell Neighbor, Missing Data Imputation, Classification
PDF Full Text Request
Related items