Font Size: a A A

Research On Classification Algorithm Of Meteorological Imbalanced Data

Posted on:2021-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:T T CaoFull Text:PDF
GTID:2370330605460934Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent decades,natural disasters such as extreme weather and dust storms have frequently occurred in Gansu,Xinjiang,Ningxia and other places in the western region of China,seriously affecting the local ecological environment,social economy and people's lives.We need to analyze and conduct in-depth research on these meteorological data,such as the classification of sand and dust storm meteorological data,so as to provide correct decisions for the government or agricultural disaster early warning.Classification is an important part of data mining and machine learning.Traditional classification algorithms generally assume that the sample distribution is in a balanced state,but there are a lot of imbalanced data in real life,which brings certain challenges to the research.Considering the problem of imbalanced data distribution in sand and dust storm meteorological data,the dissertation is essentially a study of the problem of imbalanced data classification.Imbalanced data classification has important applications in many fields,such as credit card fraud,medical health prediction,and anomaly detection.For imbalanced data classification,the misclassification cost of a few classes is relatively large.For example,in weather forecasting,people pay more attention to the prediction accuracy of extreme weather such as sandstorms,rainstorms,and frosts.The traditional classification method aims to maximize the overall classification accuracy,which greatly limits the application of classification technology in practical problems.Therefore,the main purpose of this dissertation is to study and train a model with high accuracy and good robustness on the imbalanced public data set and meteorological data set,so that it can better perform the classification of sand and dust storms.This dissertation studies the background and significance of the problem of imbalanced data classification and the current research status at home and abroad,and analyzes and discusses the related theories of imbalanced data classification.One of the most common is the use of sampling techniques(oversampling,undersampling,SMOTE and the corresponding improved algorithm)to balance the data,another algorithm level is to make corresponding adjustments to the traditional classification algorithm,research mainly from cost-sensitive,integrated algorithms,threshold movement and other aspects.Finally,the relevant evaluation indexes of imbalanced data classification such as F-measure,Kappa,AUC,and G-mean are studied.Aiming at the problem that the SMOTE oversampling algorithm is prone to fuzzy boundaries,this dissertation first proposes the BSL-FSRF algorithm based on hybrid sampling and Relief F feature selection from the data level.The algorithm first proposes BSL sampling,divides the minority samples into safety samples,noise samples,and boundary samples.OnlySMOTE interpolation is performed on the boundary samples,and then the Tomek link is used for data cleaning,so that the data set is basically balanced while reducing the number of noise samples;secondly,the idea of "hypothesis interval" is introduced to measure each feature dimension,set an appropriate threshold,remove features that are not highly relevant to the category,and reduce the dimensionality of the data;finally,random forest is used as the classifier,and the improved The grid search(Gridsearch)algorithm optimizes parameters and saves running time.The BSL-FSRF algorithm was experimentally verified on a public data set,and the results show that the algorithm has a significant improvement in the classification accuracy of a few samples and the overall performance of the classifier.Secondly,a cost-sensitive Stacking integration algorithm KPCA-Stacking is proposed at the algorithm level combining cost-sensitive learning ideas and Kernel principal component analysis(KPCA).Cost-sensitive learning is an important strategy to solve the problem of imbalanced data classification.The non-linearity of the data characteristics also brings some difficulties to the classification.The algorithm first uses the adaptive synthetic sampling(ADASYN)to oversample the original data set and perform KPCA dimensionality reduction,followed by converting KNN,LDA,SVM,RF into cost-sensitive algorithms according to the Bayesian risk minimization principle as the primary learner of the Stacking integrated learning framework,and logistic regression as the meta-learner.Stacking two-layer architecture is integrated and the KPCA algorithm can effectively extract the non-linear features of the data.Experiments show that the cost-sensitive KPCA-Stacking algorithm achieves better classification results.Finally,on the imbalanced data of sand and dust storms in some areas of Gansu,combined with the cost-sensitive KPCA-Stacking algorithm,the classification model of uneven data of sand and dust storms in some areas of Gansu was constructed,and the effectiveness of the above algorithm in the classification of sand and dust storms was verified in experiments.
Keywords/Search Tags:Imbalanced Data, Hybrid Sampling, Cost Sensitive, Stacking Integrated Learning, Sandstorm Classification
PDF Full Text Request
Related items