Font Size: a A A

The Classification Of Less Labeled Imbalanced Data Base On Active Learning

Posted on:2023-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z D ZhaoFull Text:PDF
GTID:2568306836973849Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Classification algorithms are an extremely important and fundamental part of data mining.Traditional classification methods tend to focus on the overall classification accuracy of data,and when applied to unbalanced data,the overall classification accuracy is usually improved at the expense of the classification accuracy of a few classes.In many practical applications,the role of minority class samples in unbalanced datasets is crucial,so the study of classification algorithms for unbalanced data has received extensive attention from experts and scholars.The paper first proposes an initial sample selection strategy for the binary classification problem of imbalanced data to focus on the minority class and reduce the overhead of subsequent iterations,and extends the strategy to linearly indistinguishable datasets.Then,a traditional support vector machine based active learning strategy is used to calculate the cosine similarity of the majority class of samples to be labeled,and select the sample with the smallest value,and then form a balanced set of samples to be labeled together with the minority class samples.The experiments are conducted on the mushroom and Reuters-21578 datasets to verify the feasibility of the strategy,and the results show that the strategy can effectively reduce the active learning iteration time.Since manual annotation in active learning requires the participation of additional experts and is costly,this paper combines semi-supervised learning with active learning to propose an active learning method that reduces sample redundancy.The method can select the most informative samples and the selected samples are highly representative,which can effectively avoid invalid sampling in the active learning process;in addition,the semi-supervised learning strategy based on direct push support vector machine is improved,and the time consumption in the semi-supervised learning process is effectively reduced by eliminating most classes of samples in batch.Finally,the above two strategies are integrated and related experiments are conducted on digits and MINST datasets respectively,and the results show that the strategy can effectively reduce the algorithm execution time while guaranteeing the classification accuracy.
Keywords/Search Tags:Active learning, Unbalanced data, Semi-supervised learning, Support vector machines
PDF Full Text Request
Related items