Font Size: a A A

Classification Of Non-equilibrium High-Dimensional Small Sample Data Based On RF And LSSVM Models

Posted on:2021-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y C WangFull Text:PDF
GTID:2428330602477592Subject:Statistics
Abstract/Summary:PDF Full Text Request
In the new era of information explosion,due to the rapid development of global science and technology and economy,data has always existed in every corner of the world,the structure of data has also become diversified.Among them,data classification is the most common,accompanied by two processing difficulties,one is non-equilibrium problem,the other is high-dimensional problem.However,when traditional data methods are used for data mining,low-dimensional balanced data are paid close attention to.Traditional classification methods include linear discriminant analysis,Logistic discriminant model,support vector machine algorithm,K-mearest neighbor algorithm,decision tree algorithm,random forest algorithm,neural network learning,etc.However,there are a large number of high-dimensional unbalanced data in various fields at present while traditional methods pay less attention to the classification of unbalanced data.At present,when classifying unbalanced data,due to the serious deviation of the number itself,the overall classification accuracy of the classifier is good precisely due to the correct classification of the majority of samples.However,the purpose of our classification is often to pay more attention to the accuracy of a few samples,so it is not ideal to directly classify unbalanced data sets by using common classification algorithms.The processing of another high-dimensional data is also a difficulty in the research of model identification field.The necessary,representative and sufficient minimum feature subset is identified from the feature set of the data to reduce the dimension of feature space.Therefore,exploring the classification and processing of highdimensional unbalanced data involves the development of various fields.The processing and classification of high-dimensional unbalanced data is particularly important in data mining.In this paper,a new algorithm is proposed to overcome the shortcomings of the basic algorithms for processing high-dimensional unbalanced data,Random Forests algorithm and Oversampling technology.Firstly,particle optimization swarm algorithm?PSO?is combined with Gini and OOB estimation of the special selection criteria of random forest model to propose MOG algorithm,which is used to reduce the dimension of high-dimensional data.Secondly,the SMOTE algorithm is improved by using the machine learning method under the criterion of the sum of squares of dynamic deviations?PDSD?,and then the PDSSD-TSMOTE algorithm is proposed and the data structure is equalized by this algorithm.Finally,the standard particle swarm optimization algorithm is used to improve the least squares support vector machine?LSSVM?classifier and classify the integrated data to verify the effectiveness of the data integration algorithm proposed in this paper.The experimental data sets are four real data sets from the U.S.Machine Learning Library?UCI?.The experiment result shows that the MOG-PDSSD-TSMOTE algorithm is superior,PSO-LSSVM classifier is used for data classification,The 670)? 8)0)?69?value and (8(8(6(8 value of the classification of the data set Arrhythmia increased by 15%,11.7% and 8.2%;The 670)? 8)0)?69?value and (8(8(6(8 value of the classification of the data set Regular Colonoscopy increased by 17.2%,12%,11.4%;The 670)? 8)0)?69?value and (8(8(6(8 value of the classification of the data set Voice back increased by 21.1%,16.6% and 13.5% respectively.
Keywords/Search Tags:high-dimensional disequilibrium data, Random forest model, Particle Swarm optimization algorithm, MOG algorithm, PDSSD-SMOTE algorithm, Least squares support vector machines algorithm model
PDF Full Text Request
Related items