Font Size: a A A

Research On Handling Missing Date Based On Statistical Learning

Posted on:2013-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:L CaoFull Text:PDF
GTID:2248330377959112Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, social digitization and economy, the size ofthe data are growing at an amazing speed. Obtaining valuable information from large databecomes more and more important, therefore, data mining technology came into being. Morepeople put more attention to data mining. The majority of data mining algorithm and modelare based on the ideal data set, however, the real data is often incomplete, namely, missingdata. We usually handle the missing data by some method, and we can mine data on completedata set.There are many imputation methods on missing data estimation, each method has itsspecial advantage and disadvantage. Based on a mass of studies on missing data, in this paper,we proposed a method on missing data which including four major steps. There are variableselection, regression imputation, cluster analysis, regression imputation. This method is basedon a lot of knowledge on statistical learning, so it is called the method of handling missingdate based on statistical learning. In addition, in this paper, for the cluster analysis which usedin new method on missing date, we do a large amount of research on the advantages anddisadvantages in K-means, and we proposed an improved clustering algorithm. Then weproposed a complete cleaning process flow on handling missing value.Finally, we did experiment respectively on the data set with clustering, a random data setand a real data set. Through a comparison with other handing missing data method, theexperiments show the effectiveness of the method of handling missing date based onstatistical learning.
Keywords/Search Tags:data preprocessing, missing data, clustering, regression
PDF Full Text Request
Related items