Font Size: a A A

Research On Class Imbalance Classification Algorithm For Stream Data

Posted on:2024-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y W WuFull Text:PDF
GTID:1528306944956549Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Imbalanced classification problems are widely present in both theoretical research and practical applications in various fields.Imbalanced data can have a huge impact on the prediction performance of machine learning and deep learning models and algorithms.The imbalanced classification problem studied in this dissertation mainly involves three dimensions of imbalance.Firstly,there is an imbalance in the number of samples in each class,with one class having a much larger number of samples,generally referred to as the majority class,and the other class having a much smaller number of samples,generally referred to as the minority class.Imbalanced class numbers can cause the model to predict all samples as the majority class to show better evaluation metrics,resulting in the model becoming ineffective.Secondly,there is an imbalance in the difficulty of classifying samples,with the vast majority of samples being easy to classify,generally referred to as easy samples,and a small proportion of samples being difficult to classify near the classification threshold,generally referred to as hard samples.In practical applications,people hope that the model can accurately classify hard samples.Generally,easy samples occupy an absolute majority,playing a dominant role in overall metrics and loss values,which leads to the model ignoring the learning of hard samples when there is an imbalance in classification difficulty,resulting in the loss of actual predictive value of the model.Thirdly,there is an imbalance in the number of features of samples,with fewer decisive features for the minority class samples,and only a few features having a larger positive contribution to the prediction results of the minority class,while most features not only have a small contribution but also take up computing resources,diluting the predictive effect of important features.These three dimensions of imbalanced classification problems are widely present in many issues such as network security detection,medical image recognition,e-commerce product recommendation,news information recommendation,and traffic condition prediction.Solving these three dimensions of imbalance can bring significant improvements to imbalanced classification problems.Streaming data is a type of data that arrives continuously over time and has strong temporal correlation.In the analysis and research of streaming data,there is a high degree of imbalance,combined with the characteristics of wide coverage,large total data volume,rich features,and low operability,which greatly reduces the effectiveness of general classification methods.Therefore,this dissertation focuses on the research of imbalanced classification algorithms for streaming data.This dissertation studies three categories of streaming data.Firstly,resource streaming data refers to the stream recording data generated when users access Uniform Resource Location(URL)information,which contains the content of the resources accessed by the user.Secondly,feature streaming data is a type of streaming data that describes the preference,attribute,or action characteristics of users,products,news,videos,etc.over time,generated when users visit,browse,click on objects such as products,news,or videos.It contains the temporal action records of the described objects and the object characteristic information obtained through recording statistics.Thirdly,motion streaming data is the parameter stream data generated during the movement of a moving body(person,car,aircraft,etc.),which includes a series of motion states and characteristics such as body speed,acceleration,and physical performance.It is called motion streaming data due to its recording of kinematic-related information.The study of different categories of streaming data involves different dimensions of imbalanced classification problems.In the study of resource and motion streaming data,the main problem is serious class imbalance in the number of sample categories.This dissertation proposes supervised and unsupervised classification methods to address the class imbalance problem in both categories.In the study of feature streaming data,there are not only problems with the number of sample categories but also serious imbalances in sample discrimination difficulty and feature quantity.This dissertation proposes a comprehensive solution for the three dimensions of imbalanced classification problems in feature streaming data and cross-validates the effectiveness of the proposed supervised imbalanced classification algorithm.This dissertation studies the class imbalance classification problems in security detection,recommendation systems,and vehicle prediction in the fields of resource streaming data,feature streaming data,and motion streaming data,respectively.The innovative loss function proposed in this dissertation can be used interchangeably in supervised classification methods.This dissertation also performs fusion testing and analysis of the algorithms proposed in Chapter 2 in Chapter 3.The main research content and innovation points of this dissertation are as follows:1.Cost-sensitive Siamese imbalance classification algorithm based on the resource stream dataThe research focuses on the User Identity Linkage(UIL)problem in resource stream data,aiming to improve the accuracy of natural person and virtual account,browsing behavior association,and lay an important foundation for solving network security and child user detection issues.In this study,the UIL problem is first abstracted and expressed mathematically,and the network optimization goal is clearly defined.Then,from the perspective of network structure,a Siamese Architecture based User Identity Linkage(SAUIL)model with Bidirectional Long Short Term Memory(BLSTM)subnet is proposed to compare the output distances of the subnet and determine whether two groups of access information are generated by the same natural person.In addition,considering the imbalance of sample category numbers,a Mean Distance False Error(MDFE)loss function is proposed to balance the loss values of different sample sizes and regulate the network’s precision and recall performance.Finally,three real datasets with different ranges,sizes,and sources are used for verification.The network performance is compared using a controlled variable method after tuning through grid search.The effectiveness and stability of the loss function are also verified.The experimental results show that SAUIL and MDFE have significantly improved the performance compared to the baseline algorithm,and the stability of the loss function is good under multiple parameters and training.2.Multi-expert attention self-adjusting imbalanced classification algorithm based on characteristic Stream dataThis paper focuses on optimizing the recommendation algorithm for feature stream data.Improving the prediction accuracy of the recommendation algorithm will effectively enhance the efficiency and user experience of e-commerce,news,video and other Internet platforms,bringing greater benefits to governments,enterprises,and users.In this study,a Loss Aware Feature Attention Mechanism Network(LAFAMN)based on attention mechanism was proposed to address the multidimensional class imbalance of feature stream data.The model uses attention mechanism to weight different input feature groups and self-regulate the weights based on the prediction results of each subnetwork,effectively solving the problem of feature imbalance.Complementarily,a dual-suppression loss function based on classification confidence control(Suppression Loss)was proposed.It suppresses the contribution of loss values of easily judged samples by taking the power of the loss values and suppresses the contribution of loss values of largeclass samples and easily judged samples through function nesting while ensuring that the loss values of small-class difficult-to-judge samples are not reduced,thus simultaneously addressing the class imbalance problem in two dimensions of discriminability and sample quantity.The joint use of LAFAMN and dualsuppression loss function can form a complete recommendation system to solve the class imbalance classification problem in three dimensions.Finally,this paper verifies the effectiveness of LAFAMN network and dual-suppression loss function on class imbalance classification problem through two real datasets by controlling variables and conducting ablation experiments,and verifies the training stability of dual-suppression loss function.3.Imbalanced classification algorithm of multi-clustering fusion architecture based on motion stream dataThis dissertation focuses on the prediction of vehicle working conditions based on motion stream data,and the algorithm mainly targets the unsupervised classification problem of kinematic segments.Improving the accuracy of kinematic segment classification can provide assistance for vehicle working condition research,aircraft operation analysis,athlete competition analysis,etc.In this study,a multi-channel fusion architecture model algorithm(MultiClustering Stacking Algorithm,MCSA)is proposed to address the problem of imbalanced sample categories in motion stream data.The algorithm splits the dataset of large sample categories into multiple sub-datasets with low imbalance by using two different algorithms for learning each sub-dataset and obtaining a strong feature dataset through weighted correlation sorting.The final result is obtained by clustering the strong feature dataset again.In addition,T-distributed Stochastic Neighbor Embedding(T-SNE)is used instead of Principal Component Analysis(PCA)in traditional vehicle working condition graph construction to reduce dimensionality,and the experimental results show that T-SNE is superior to PCA.This dissertation provides a complete working condition graph construction process from an engineering perspective,as well as the operation process of the MCSA network.Since the classification problem of kinematic segments is unsupervised,this dissertation conducts the first round of validation through visualization.The second round of validation is done by using different classification methods to construct working condition graphs for numerical analysis based on the same working conditions,with the error rate as the standard.Three real datasets from Fujian Province,China are used for experiments,and the results show that the MCSA network has good classification performance,with minimal overlap between categories and tight intra-category cohesion.The working condition graph established by MCSA has a much lower error rate than the baseline working condition graph based on actual statistical data.
Keywords/Search Tags:Class Imbalance Data, Loss Function, Cost-Sensitive Algorithm, Classification Algorithm, Recommendation Algorithm
PDF Full Text Request
Related items