Font Size: a A A

The Algorithm Of Filling Missing Data Based On Cluster Analysis

Posted on:2014-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2298330467468782Subject:Mechanical and electrical engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the analysis ofelectronic data gradually improved more convenience, easier and safer. In thepractical application, the traditional manual recording methods are graduallyreplaced by computer inputting method. However incomplete data is widespreadin various application fields, and missing data on computing will have a seriousimpact on properties and results. It has become an urgent problem to be solved ofhow to quickly and effectively find the closest data substitute for the real missingdata.In order to improve the accuracy of filling missing data, this paper conductedin-depth research on correlation algorithm of data mining, especially the clusteranalysis and analysis of the MGNN (Mahalanobis-Gray and Nearest Neighboralgorithm) algorithm.This paper improved the calculation formula of distance basedon the MGNN algorithm, and proposed for ADGMKNN (AdvancedGray-Mahalanobis and k-Nearest Neighbor algorithm) algorithm by combining withclassification features to the clustering analysis. To calculate the distance with greycorrelation degree and Mahalanobis distance, and sort the weighted distance afterclustering analysis and select K minimum distances. To fill value with the averagevalue of elements if data is continuous, and fill value with the maximum use if datais discrete.Research shows that in the case of the different relationship in the unknowncircumstances, the gray association analysis is better to calculate the intimacyeffect of multiple instance; density relations clearly in the case between the cases ofcalculation examples, using Euclidean distance or Mahalanobis distance tocalculate the degree of correlation is more effective[4]. So in two different cases, theeffect to calculate the correlation between the cases is complementary by usinggrey relational analysis and Euclidean distance. Combining the two methods to measure the degree of the relationship will be more accurate. Therefore, this paperto fill missing data based on MGNN algorithm, and it improved the calculationformula of the correlation distance.This paper selected information randomly by WIND software in2011Januaryto2012October between some stocks as the experimental data, and improved casedistance formula, combined with the effect of class average method in systemclustering method to classificate the attribute of examples. This paper fill missingdata respectively on the same data with ADGMKNN algorithm, KNN algorithm andMGNN algorithm by MATLAB as experiments. The experimental results show that,in the absence of missing one value, the average error rate of the ADGMKNNalgorithm is3.17%, and it is lower than the average error of KNN algorithm andMGNN algorithm; in the case of missing a part of data (missing data to total data is5%), the mean error square root of ADGMKNN algorithm, KNN algorithm andMGNN algorithm respectively is2.223,2.699,2.848, and ADGMKNN algorithm ismore accuracy than MGNN algorithm and KNN algorithm to fill missing data.
Keywords/Search Tags:Grey correlation, Mahalanobis distance, cluster analysis, nearestneighbor alorithm
PDF Full Text Request
Related items