The rapid development of new generation information technology and its wide application in various fields have triggered the explosive growth of data.It is an essential task at this stage to explore the potential information contained in the vast amount of data and exploit the data’s value.As a very important task in data mining,data classification has a significant research value.However,in many practical problems,class imbalance is inevitable in both structured and unstructured data,which brings difficulties and challenges to data classification.Recently,imbalanced classification algorithms have made a series of significant advances in theory,methods,and applications.However,they still face challenges such as class overlap and intra-class imbalance,insufficient minority class representation capability,and lack of supervisory information.Focusing on these challenges,this thesis presents innovative research on classification algorithms for imbalanced data.The main research results are as follows:(1)Aiming at the problem of class overlap and intra-class imbalance in traditional imbalanced data classification,we propose an adaptive undersampling-based imbalanced data classification algorithm.First,we uses the nearest neighbor search algorithm to identify the majority class samples in the overlap area and removes them.Then,the improved density peak clustering is applied to automatically obtain multiple sub-clusters with different shapes,sizes,and densities.Finally,sampling weights are calculated according to the densities of samples in the sub-clusters,and undersampling is performed according to the sampling weights.The classifiers trained on the obtained balanced datasets are integrated by bagging.Experiments show that the proposed adaptive undersampling method based on density-peak clustering can significantly improve the performance of imbalanced data classification compared to the existing undersampling methods.(2)Aiming at the problem of insufficient minority class representation in imbalanced node classification,we propose a hybrid samplingbased graph contrast learning algorithm for imbalanced node classification.The core of this algorithm is to balance the negative sample set using hybrid sampling so that the different classes of samples are balanced.It enhances the representation of minority class nodes and thus improves the performance of imbalanced node classification.The extensive experimental results show that the method can improve the classification performance compared with graph contrastive learning,and it can obtain superior results than other state-of-the-art imbalanced node classification methods.(3)Aiming at the lack of supervisory information in imbalanced node classification,we propose a self-supervised learning-based algorithm for imbalanced node classification.On the one hand,the algorithm expands the supervision information through self-supervised learning,and on the other hand,it enhances the expressive ability of nodes through selfsupervised learning.In addition,a semantic constraint loss is designed to ensure semantic consistency in graph data augmentation regarding crossentropy loss and self-supervised contrastive loss.Experimental results on real graph datasets show that the proposed algorithm can obtain discriminative representations that are more effective for the imbalanced node classification task.In conclusion,this thesis proposes a series of imbalanced classification algorithms based on adaptive undersampling,graph contrast learning and self supervised learning technologies,to address the challenges faced by imbalanced classification algorithms,such as class overlap and intra-class imbalance,insufficient representation ability of minority classes,and lack of supervisory information.They provide some new methods and ideas for imbalanced data classification.The research results have some theoretical significance and application value for the analysis and mining of imbalanced data. |