| The rapid development of information technology and intelligence has brought about a large amount of data,and it has become an urgent task to analyze and process the data to obtain information and knowledge.The data corresponding to different research objects have their own characteristics,such as fraud detection in the financial field,disease detection in the medical field,accident severity analysis in the traffic field,etc.The analyzed data have a natural imbalance,i.e.,normal sample data account for most of the data and abnormal sample data are often less.However,the sample of anomalous states,which accounts for a minority of the data,is the data of greater interest.One refers to this data pattern as imbalanced data.Unbalanced data poses a challenge for classification using machine learning methods.The Synthetic Minority Oversampling Technique(SMOTE)is the most commonly used method to address this challenge.The SMOTE method reduces the overfitting problem that arises from randomly replicating a few classes of sample points by avoiding the random selection of samples in sampling.The method provides an effective way to solve the imbalanced data problem.However,the SMOTE method still has some limitations,such as noise interference,intra-class imbalance and non-differentiability in the selection of base sample points in the sample synthesis process.In this paper,the SMOTE algorithm is improved for these shortcomings,and the improved algorithm is validated for analysis and application verification.The main research work is as follows:(1)An improved oversampling method is proposed based on the shortcomings of SMOTE.The method can balance the data better.Firstly,before balancing the data,the KNN algorithm is applied to filter out the noisy samples in the data set to address the problem of the influence of noisy samples on sampling.Second,at the stage of oversampling and balancing samples,the minority class samples are adaptively and more effectively divided into clusters,and the local density of the minority class samples is calculated,which solves the problem caused by the imbalance of data within the class.Since samples with higher densities have more useful information than others,the quality of the samples in each cluster is determined and the minority samples with higher densities are selected by probability using a roulette wheel selection operator to synthesize new samples of the minority class to balance the data distribution.The imbalance of the data is improved by the above steps.At the same time,in order to verify the effectiveness of the improved algorithm,a variety of data sets with different degrees of balance and different amounts of data selected in the UCI have fully verified the effectiveness of the improved algorithm proposed in this paper.(2)Construction of a truck fault detection model based on the improved SMOTE algorithm.A real truck fault prediction dataset is selected to construct a truck fault detection model based on the improved SMOTE algorithm based on the analysis of fault detection methods.In the sampling algorithm dimension,the random oversampling algorithm,SMOTE,Borderline-SMOTE,ADASYN sampling algorithm,and the improved algorithm in this paper are selected;in the training model dimension,three classifiers of logistic regression,KNN,and decision tree are selected;combining the sampling algorithm and training model,we try to construct various truck fault detection models.On different models,truck fault prediction datasets are applied to analyze the output results of different truck fault detection models.And the corresponding evaluation indexes are used to analyze and compare the different truck fault detection models,and the truck fault detection model with the best performance is selected.The experimental results show that the truck fault detection model based on the combination of the improved SMOTE algorithm and KNN classifier in this paper achieves the best performance index on the truck fault data set and can effectively improve the accuracy of fault data prediction.Thus,the effectiveness of the improved SMOTE algorithm proposed in this paper is verified from the perspective of practical applications. |