With the rapid development of the Internet in China,people’s lives have become more convenient.At the same time,the proliferation of illegal data such as adult images,gun images,and violent images has become increasingly serious,posing severe harm to the physical and mental health of adults and adolescents,and may even lead to crime.After realizing this hazard,relevant departments have attempted to solve the problem using methods such as manual review or machine review.However,due to the large volume of content on the Internet,the screening work is cumbersome and highly subjective,and traditional methods cannot cover the growing amount of data.Moreover,illegal information poses potential risks to the physical and mental health of Internet users and affects social stability,relevant national departments have explicitly prohibited the spread of such information.Therefore,there is an urgent need to study effective methods for identifying illegal data that are adapted to the current environment.This thesis focuses on the research of deep learning algorithms for identifying illegal image data and proposes an identification algorithm for class-imbalanced illegal image data and a semi-supervised illegal image data identification algorithm based on pseudo-labels.The main research work of this thesis includes:1、In response to the class imbalance problem in non-compliant image datasets,this chapter aims to mitigate the impact of class imbalance on the model and proposes an identification algorithm for non-compliant image data based on class imbalance.The algorithm employs data augmentation techniques to expand the samples of minority classes;designs multi-path residual modules to enable the model to extract richer non-compliant image feature information;and proposes an anti-feature overlap cross-entropy loss function to assign different weights to samples of imbalanced classes.Ablation experiments demonstrate the effectiveness of the proposed improvements in the algorithm.Comparative experiments show that the algorithm outperforms other algorithms in various evaluation metrics on multi-class imbalanced non-compliant image datasets.2、In response to the difficulty of manually labeling a large number of non-compliant image data,this chapter focuses on the identification of non-compliant image data in semisupervised scenarios and proposes a pseudo-label-based semi-supervised non-compliant image data identification algorithm.The algorithm constructs an ensemble learning model consisting of three classifiers and uses a labeled training set combined with data augmentation for initial training;a joint decision algorithm is designed to resolve disagreements among the three classifiers,assign pseudo-labels and weights to unlabeled samples,and screen out highly reliable samples.A deep residual model incorporating attention mechanisms and width learning is designed,which is trained using labeled data and pseudo-labeled data.Ablation experiments demonstrate the effectiveness of the proposed improvements,and comparative experiments show that the algorithm’s performance is superior to other algorithms. |