Font Size: a A A

Theoretical analysis of classification under CCC-Noise and its application to semi-supervised text mining

Posted on:2009-06-10Degree:Ph.DType:Thesis
University:University of California, RiversideCandidate:Bi, YingtaoFull Text:PDF
GTID:2448390005950554Subject:Statistics
Abstract/Summary:
In many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this dissertation, investigations are made on the impact of CCC-Noise on the estimates of the unknown parameters in a popular generative classifier, Normal Discriminant Analysis (NDA) and its corresponding discriminative classifier Logistic Regression (LR). The misclassification error rate of these two classifiers under CCC-Noise is also compared. Typically, asymptotic error rate of LR is much larger than NDA under the normal assumption, although both increase as the noise level increases. It is also shown that under low CCC-Noise contexts, NDA approach converges to its lower asymptotic error rate much faster than LR under the normal assumption.;Following the theoretical analysis of the performance of these two classifiers under CCC-Noise, the logistic regression paradigm is extended to incorporate the CCC-Noise directly into the associated likelihood function and the EM algorithm is used to compute valid maximum likelihood estimates. Simulation studies show significant improvements with this approach compared to naively ignoring CCC-Noise.;The modified likelihood approach for dealing with CCC-Noise can be applied to the rapidly growing area of semi-supervised learning. In particular, a new semi-supervised learning framework is proposed where a classifier is built from the labeled training dataset and then applied to an unlabeled dataset to derive pseudo labels. These pseudo labels are considered as CCC-Noise. The mislabeling probabilities of the classifier built from the training dataset are estimated via cross-validation. The pseudo labeled data is then incorporated into the labeled data probabilistically. The final part of this thesis applies the proposed learning framework to an airline safety inspection text report classification analysis, and compares it to a supervised naive Bayes method.
Keywords/Search Tags:Ccc-noise, Classification, Classifier, Semi-supervised
Related items