Font Size: a A A

High-dimensional Sparse Discriminant Analysis By Thresholding Covariance Matrix

Posted on:2016-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y L WanFull Text:PDF
GTID:2308330470471868Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of the Internet and other science and technology, in some research fields, such as social investigation, life science of gene and so on, a large number of data can be collected every day. Hence, how to extract useful information from these data, which has become the new challenge faced by people. In this case, the technology of data mining emerges as the times require. Classification is an important branch of technology of data mining, and plays an essential role in the field of image classification, pattern recognition of behavior. So far, the research and application of how to use the information of known samples to discriminate effectively has attracted the attention of academia. However, the dimension p is increasing and larger than the size of training samples, which has been known as "large p and small n". Because the classic discriminant methods, such as distance discriminant rule, Bayes rule, Fisher discriminant rule, are no longer applicable, further analysis of theory is necessary.In fact, there have been many sparse discriminant methods, such as Independence Rule, Feature Annealed Independence Rule (FAIR) and Nearest Shrunken Centroids Classifier (NSCC). These classifiers modify the parameter estimation procedures or use t-test statistic to select features based on the classic Bayes rule. However, they only consider the feature variance, and ignoring correlations among features in estimating the popular centriods, which may produce inferior classification in some scenario.This paper researches discrimiant method of high-dimensional sparse data, and mainly adopts the measure of thresholding covariance matrix. Firstly, combining the classical Bayes discriminant method with thresholding the covariance matrix, a new classification method is proposed for high-dimensional sparse discriminant analysis. The sparsity of high-dimensional covariance matrix is made the best, and utilizing the hard-thresholding to truncate the sample covariance matrix, which makes the parameter estimation closer to the true covariance matrix. New method considers not only all the features, but also the large correlations among features, and ignores the small correlations or the zero ones. Then, we study the theoretical value of the misclassification error, and validate its performance by simulation studies. In the procedure of experiment, the truncation threshold which lead to the minimum average of misclassification proportion is chosen as the hard threshold based on 5-fold Cross-Validation. On the one hand, the discriminant analysis is performed by the simulated data, which generate from three different forms of the covariance matrix. At the same time, comparing the average and standard deviation of 100 simulation results with the other sparse discriminant methods (Bayes discriminant rule and independent rule). On the other hand, a dataset of case has been carried on the discriminant analysis, and the case is a classification problem about the credit of customer in bank, the aim of the experiment is to distinguish whether the customer is creditable or not. Comparing by the results of three kinds of the discriminant method, thereby it further embodies that the performance of our proposed discriminant method is super to others.
Keywords/Search Tags:High-dimensional classification, Sparse discriminant, Thresholding covariance matrix, Cross-Validation
PDF Full Text Request
Related items