Font Size: a A A

Research On Decision Tree Classification Method Of Imbalanced Data Based On Reinforcement Learning

Posted on:2019-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z NiuFull Text:PDF
GTID:2348330569479967Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with popularization of the internet and improvement of information,industries are creating more and more data.Rapid classification and recognition is the key to improving the speed of intelligence information processing effectively in various industries and speed up development of related industries.Although the amount of data is increasing,some types of data still occupy a very little part,that is,data sets with this part is imbalanced.usually these minority class data are the focus of research.At present,existing classifiers are not so good at identifying the minority class samples while the data is imbalanced.Based on the analysis of imbalanced data distribution,this paper presents an improved under-sampling algorithm which based on redundancy-removed to pretreatment the imbalanced data sets.Studying the decision tree classification algorithm and reinforcement learning this paper proposes a new ensemble forest algorithm classification model.The main work of this thesis is as follows:Firstly,this paper propose an improved under-sampling method based on clustering fusion and redundancy-removed,applied it to the imbalanced data pretreatment before classification prediction and compared with the existing under-sampling methods.By analyzing defects of existing under-sampling algorithms and based on distribution of imbalanced data sets,this paper propose the concept of similarity redundancy coefficient,and the data set is under-sampled through this coefficient.Results show that this method can improve the class positive rates and the G-mean value significantly on thepremise which the classification accuracy is basically unchanged.Secondly,a decision tree optimization model based on reinforcement learning cumulative return attribute selection method is proposed.By analyzing the formation principle of intensive learning,combining the growth mode of the decision tree,a cumulative returning learning attribute selection method is proposed.The cumulative returning learning factor is integrated into the attribute selection process of decision tree split node,and the classification of the decision tree to minority samples is strengthened.By comparing the cumulative returning learning method,the cost-sensitive learning method which based on decision tree and the original decision tree classification model,this experiment proves effectiveness of this method.Thirdly,Based on random forest algorithm,an improved integrated forest algorithm which based on the same distribution random sampling is presented.Analyzing and studying the principle of random forest algorithm,combined distribution characteristics of imbalanced data sets,this paper proposed a new method used on the same distribution sampling.The sample subset obtained through this sampling method,not only maintain the distribution of the original data set,but also reduce the imbalanced rate of the sample subset.The ensemble forest algorithm is formed by cumulative returning learning method and the same distribution sampling method.Finally,the effectiveness of the proposed ensemble forest algorithm is verified by experiments.
Keywords/Search Tags:Imbalanced data set, clustering, redundancy-removed under-sampling, cumulative returning, ensemble forest
PDF Full Text Request
Related items