Font Size: a A A

Research On Weakly-supervised Classification Methods Based On Samples And Labels Modeling

Posted on:2020-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2428330599956775Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Supervised classification techniques are based on the assumption of strong supervision,and train models leveraging a mass of samples with single and unambiguous ground-truth labels.Although existing supervised classification techniques have achieved a great success,because the data annotation process requires a lot of manpower and resources,and these annotations are subjected to many factors such as the external environment,problem characteristics and the labeler's own reasons,many data often gets a small amount of insufficient and inaccurate labels.In addition,objects in the real world can be polysemous,that is,each sample can be associated with multiple labels.In the polysemous scenario,the exponential-scale output space makes the learning system needing more adequate supervision.Under the weakly-supervised scenarios: insufficient supervision(i.e.,insufficient labeled samples),inaccurate supervision(i.e.,inaccurate labels),and polysemous supervision(i.e.,each sample can have multiple labels),the traditional supervised classification framework is difficult to achieve a good performance.Therefore,it is of great significance to study the classification methods under the weakly-supervised scenarios.In this thesis,we study the above three weakly-supervised scenarios,i.e.,insufficient supervision,inaccurate supervision and polysemous supervision,based on semi-supervised learning,multi-label active learning,and partial multi-label learning.The main contents are as follows:1)Solve the problem of insufficient supervision based on semi-supervised learning:Since samples in the real world are not always evenly distributed.Some samples from different classes but close to the decision boundary may be close to each other,and thus they are easy to be misclassified.To mitigate this issue,we propose an approach called Semi-Supervised Classification based on Clustering Adjusted Similarity(SSC-CAS).SSC-CAS first conducts clustering algorithm on both labeled and unlabeled samples to explore the structure of samples,and then adjusts the similarity between pairwise samples by multiplying the similarity between centers of clustersthey belong to.In this way,if two samples are from different clusters,the similarity between them is reduced;otherwise,unchanged.After that,SSC-CAS performs semi-supervised classification on the constructed similarity graph.Empirical study demonstrates the effectiveness of the proposed graph construction strategy,and proves that SSC-CAS has better performance than other related comparing methods.In addition,most existing semi-supervised classification methods handle each sample as equally important.In fact,samples close to the decision boundary of different classes are generally more important than samples far away from the boundary.To account for this problem,we propose an approach called Weighted-Samples based Semi-Supervised Classification(WS3C).WS3 C first executes multiple clusterings on all samples to quantify the hard-to-cluster index of samples and to measure the similarity between samples.The samples closer to the decision boundary are more difficult to cluster.Then WS3 C weights the samples using the hard-to-cluster index,and combines the weighted samples and the similarity between samples to develop a regularization framework based on manifold learning to predict the labels of unlabeled samples.Empirical study demonstrates that assigning samples with different weights significantly improves the accuracy than equally treating all samples and WS3 C achieves better performance than other related comparing methods.2)Solve the problem of insufficient supervision under polysemous supervision based on multi-label active learning: In the polysemous supervision scenario,the labeling process of samples is more difficult and cost-consuming.Whether or not a particular label is relevant for an sample depends on the characteristics of the sample itself.However,current active learning methods require the scrutiny of the whole sample to obtain its labels.In contrast,one can find positive evidence with respect to an sample-label pair by examining specific patterns(i.e.,subsamples)of the sample,rather than the whole sample,and thus making the annotation process more cost-saving.Based on this observation,we propose a novel Cost-effective Multi-label Active Learning framework,called CMAL.CMAL first introduces a novel sample-label pair selection strategy to select the most valuable sample-label pairs by combining uncertainty,label correlation and label space sparsity.Then,CMAL iteratively queries the most plausible positive subsample-label pairs of the selected sample-label pairs.Comprehensive experiments demonstrate CMAL can achieve a better classification performance than comparing methods under the same cost budget.3)Solve the problem of inaccurate supervision under polysemous supervisionbased on partial multi-label learning: In the case of polysemous supervision,the annotation process of the sample is more difficult,so noisy labels are more likely to collect.It is obvious that noisy labels will lead to a degenerated classification performance.However,current multi-label learning methods assume that the obtained labels are noise-free.There is still very little work to study inaccurate supervision under polysemous supervision scenarios.In this thesis,we introduce a method called Matrix Factorization for Identifying Noisy Labels of multi-label instances(MF-INL)to identify noisy labels of multi-label samples.MF-INL first decomposes the original sample-label association matrix into two low-rank matrices using matrix factorization,and feature-based and label-based constraints are used to retain the geometric structure of samples and label correlations in the low-dimensional space.MF-INL then reconstructs the association matrix using the product of the decomposed matrices,and identifies associations with the lowest values as noisy associations.Empirical study shows that MF-INL can identify noisy labels more accurately than other related solutions.To further improve the performance of identifying noisy labels and jointly train classification models,we introduce a Feature-induced Partial Multi-label Learning approach called fPML.fPML simultaneously factorizes the observed sample-label association matrix and the sample-feature matrix into a coherent low-dimensional space to learn the low-rank approximation of the sample-label association matrix,which is then leveraged to estimate the association confidence.To predict the labels of unlabeled samples,fPML learns a matrix that maps the samples to labels based on the estimated association confidence.Empirical study shows that fPML can more accurately identify noisy labels than related solutions,and consequently can achieve better performance on predicting labels of unlabeled samples than competitive methods.
Keywords/Search Tags:Weakly-supervised classification, Semi-supervised learning, Multi-label active learning, Partial multi-label learning
PDF Full Text Request
Related items