Font Size: a A A

Researches On Label Noise Cleaning Based On Active Learning

Posted on:2021-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:X C MengFull Text:PDF
GTID:2428330620963144Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development and widespread use of the Internet and mobile Internet,the scale of data that people can obtain is also growing.How to extract valuable information from data becomes more and more important.As an important technology of data mining and analysis,machine learning aims to mine key information from data and predict the unknown information with the existing information,so as to provide with better decision-making support.Supervised learning is one of the main learning methods in machine learning,and label is the key feature of supervised learning,which plays a crucial role in model training.In real world,a certain degree of label noises in data may be existed due to the limitations of some subjective factors in the process of marking label,such as the limitation of professional knowledge,the influence of manual labeling error et al,which may have a severe negative impact on the model.Therefore,it is of great significance to improve the label quality of training data for supervised learning.At present,most of the processing approaches of label noise will directly filter those samples after they are recognized as noise.Although these methods of processing label noise are very simple,the data information will be lost when noise samples are discarded,especially for the situation that the number of noise samples is much more.Aiming at the problem that too many samples may be discarded by label noise filtering,this thesis will study the method of label noise recognition and processing for classification problem by the active learning technology.The main contents are summarized as follows:(1)An active label noise cleaning(ALNC)algorithm is proposed.To solve the problem that data information may be missing while removing a lot of noise samples,an Active Label Noise Cleaning algorithm is proposed.This algorithm continuously selects the most uncertain samples from existing labeled samples through active learning,which will be inspected and labeled by artificial experts.And then the labeled samples(they can be regarded as clean samples)will be put back to the original data set.Through this iterative method,high utilization rate of original data may be kept while most of noise samples can be cleaned,and the noise recognition effect could be better than that of traditional noise filtering methods.(2)Active label noise cleaning based on Sample Set Partitioning based on Joint X-Y Distance sampling(SPXY)is proposed.Although the ALNC method has good noise recognition effect and can maintain the integrity of the original data,it still exists the problem of high additional manual labeling cost.That is,there may be a certain proportion of normal samples in the selected suspected noise samples.In order to reduce the additional artificial examining cost in process of label noise cleaning,the Active Label Noise Cleaning based on SPXY sampling(SPXY_ALNC)algorithm is proposed on the basis of ALNC algorithm.Both the uncertainty and representativeness of samples are considered in proposed algorithm,and then the proposed algorithm can significantly reduce the additional artificial examining cost while keeping the original noise recognition effect constant.Aiming at the problem of poor noise recognition effect and low data utilization rate of traditional noise filtering methods,this thesis proposes a label noise cleaning method based on active learning.Part of samples are selected from training sample set through active learning,which contains as many noise samples as possible.And the algorithm is further improved by making the proportion of noise samples in selected samples be higher,which could lead to better effect of noise cleaning.The obtained research results may have certain significance and application value to the improvement of data quality.
Keywords/Search Tags:label noise, noise cleaning, active learning, SPXY sampling
PDF Full Text Request
Related items