Font Size: a A A

Research On Classification Methods For Imperfect Data

Posted on:2022-11-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C LiFull Text:PDF
GTID:1488306758479134Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is a core and basic research problem in the field of machine learning,including spam detection,disease diagnosis,credit card fraud detection just to name a few.Existing classification methods focus on inducing a parameterized classifier(e.g.,deep neural network model),which produces a label vector given the feature of an instance,from the corresponding labeled data.To guarantee the performance of the learned classifier,they have to require that the features and supervision of data are sufficient,accurate,and definite.With the rapid development of the Internet,data resources are growing exponentially,however,a huge amount of them are imperfect.In other words,many real-world data consist of sparse,missing,or corrupted features,as well as incomplete,inaccurate,or indefinite supervision.How to induce a classifier from these imperfect data is significantly challenging.In this thesis,we concentrate on the imperfect data with sparse features,incomplete supervision,and indefinite supervision,and investigate three important classification problems for these imperfect data:short text classification(STC),semi-supervised learning(SSL),and partial label learning(PLL).The main contributions of this thesis are outlined as follows:1.Shot texts are a kind of typical imperfect data with sparse features.Existing bag-of-words based STC methods always neglect semantic knowledge and suffer from document similarity misalignment.To remedy this,based on word mover's distance(WMD)and word embeddings,we propose two modified methods:(1)We propose an RWMD-based centroid classifier for short texts,named RWMD-CC.It employs the regularized WMD(RWMD)to measure semantic distances among short texts and applies the hypothesis margin to learn a representative semantic centroid for each category,so as to predict a new short text by comparing the WMDs between it and these semantic centroids,leading to the testing complexity linear to the number of categories.Experimental results show that RWMD-CC performs better on both short text classification and testing efficiency.(2)We propose a Wasserstein topic model,i.e.,Semantics-assisted Wasserstein Learning(SAWL).In SAWL,we formulate an NMF-like objective with the regularized Wasserstein distance loss,which is based on word embeddings,to introduce word semantic correlations into topic modeling,and integrate a word positive pointwise mutual information matrix factorization to refine the word embeddings for capturing corpus-specific semantics,enabling to boost topics and word embeddings each other.We also analyze SAWL and provide its dimensionality-dependent generalization bounds of reconstruction errors.SAWL can be applied to both short and long texts.Experimental results indicate that SAWL performs better on both text modeling and classification,as well as learning word embeddings.2.SSL focuses on learning a classifier from the imperfect data with incomplete supervision.Semi-supervised text classification(SSTC),as well as positive and unlabeled(PU)learning are two very important SSL problems.Existing SSTC methods based on the deep self-training spirit and PU learning methods usually suffer from low confidences of pseudo-labeled instances.To remedy this,we propose two novel methods for SSTC and PU learning respectively:(1)We propose a self-training Semi-Supervised Text Classification with Balanced Deep representation Distributions(S~2TC-BDD)method.Existing self-training SSTC methods often suffer from low accuracy of pseudo-labels for unlabeled texts because of the margin bias,caused by the large difference between representation distributions of labels in SSTC.To alleviate this problem,we employ the angular margin loss and perform a set of Gaussian linear transformations to constrain all label representation distributions balanced.With this insight,we propose the SSTC method S~2TC-BDD.Experimental results show that S~2TC-BDD outperforms SSTC baselines in most cases,especially when labeled texts are scarce.(2)We propose a novel PU learning method,namely Positive and unlabeled learning with Partially Positive Mixup(P~3Mix),which benefits from data augmentation and supervision correction simultaneously with a heuristic mixup technique.Specifically,we take inspiration from the decision boundary deviation phenomenon observed in our preliminary experiments,in which the learned PU boundary tends to deviate from the fully supervised boundary towards the positive side.To address this,for the unlabeled instances with ambiguous predictive scores,we design a heuristic mixup partners selection to transform them into augmented ones near to the PU boundary yet with more precise supervision,so as to push the PU boundary towards the fully supervised boundary and improve the classification performance.Experimental results show that P~3Mix consistently outperforms all state-of-the-art PU learning baselines.3.PLL aims to induce a classifier from partial label data,a kind of imperfect data with indefinite supervision.The mainstream of PLL methods follows the disambiguation spirit,however,these disambiguation PLL methods only consider the local information to constrain the labeling confidences,resulting in potentially less accurate estimations as well as worse classification performance.To address this problem,we propose two novel PLL methods:(1)We propose a novel partial label learning by simultaneously leveraging the global and local consistencies(PANGOLIN)method.In PANGOLIN,we develop a global consistency in feature space based on category prototypes and manifold regularization,and incorporate local consistency in label space,to jointly regularize the labeling confidences.Experimental results indicate that PANGOLIN significantly outperforms the existing state-of-the-art PLL baselines.(2)We develop an Adversarial Ambiguous Label Learning with Candidate Instance Detection(A~2L~2CID)method by performing effective candidate label disambiguation from a new instance-pivoted perspective with Triple-GAN and a complementary learning paradigm.In A~2L~2CID,we transform each PLL instance into a set of candidate instances by recombining its feature with each of its candidate labels,then employ a discriminator to detect fake candidate instances,and train a classifier without them.We theoretically prove that there exists a global equilibrium point in A~2L~2CID.Experimental results indicate that A~2L~2CID performs better than the state-of-the-art PLL methods,especially for the datasets with more candidate labels.
Keywords/Search Tags:Classification Methods, Imperfect Data, Short Text Classification, SemiSupervised Learning, Partial Label Learning
PDF Full Text Request
Related items