Font Size: a A A

Research On Cost-sensitive Algorithms Based On Multi-objective Optimization

Posted on:2019-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ShiFull Text:PDF
GTID:2438330572955967Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the application and development of computers in many fields,the scale of data generated by users in the use and data collection process has grown exponentially,and data mining technology has emerged.Cost-sensitive learning is one of the ten most challenging problems in data mining.The purpose of its related classification decision is to obtain the minimum test cost,time cost or misclassification cost,etc.The essence of cost-sensitive learning is to increase the cost index under the premise of keeping the classification ability of the decision-making system unchanged,and to reduce the total cost of classification to the greatest extent,so as to help people make decisions efficiently.Many decision tree algorithms have been proposed for cost-sensitive decision tree problems?C4.5 algorithm classification rules are easy to understand and high in accuracy,but data sets need to be scanned and sorted several times in the construction process,leading to lower efficiency of the algorithm.The CART tree algorithm is flexible,allowing partial error components,but it is very robust in the face of constructing problems such as missing values,large number of variables,etc.,but the resulting decision tree has a large branch and large scale.The previous decision tree classification study is a single introduction cost index,or the classification process is classified based on information entropy,and the results obtained are over-fitted.In the process of constructing unbalanced datasets,the cost-sensitive decision tree only considers the total classification cost and does not take into account the differences in the different samples of the unified node.When the measurement of the purity metrics is different,the results obtained are also quite different.Based on this,this paper proposes a multi-objective cost-sensitive decision tree combined with test cost and information gain,and proposes a new type of measure of impurity ambiguity for data imbalance.On the one hand,it effectively reduces the total cost of classification and improves the classification performance of the algorithm.On the other hand,it proposes a new type of ambiguity measure,which solves the problem of cost difference between different nodes and can solve the effective classification of unbalanced data.The algorithm research method of this paper is divided into the following two parts:1.The traditional cost-sensitive decision tree algorithm often only considers the misclassification cost,which is not applicable to the decision system with test cost.Therefore,this paper comprehensively considers the test cost and misclassification cost,and proposes two new cost-sensitive ID3 algorithms.The main idea of both is to use a new type of property splitting standard instead of a traditional,single information gain splitting property standard.The cost-sensitive decision tree algorithm(CT-ID3)and the test cost-information gain property splitting standard cost-sensitive decision tree algorithm(TIG-ID3)of the proposed full test cost attribute splitting criterion were compared with the traditional ID3 algorithm.2.Construct a random forest with the multi-objective cost-sensitive decision tree based on the test cost and information gain proposed in the above experiment,and propose a new type of measurement of impurity in the process of construction.Not only the total cost of the decision tree is considered,but also the cost difference of the same node for different samples is considered.Then,a random forest algorithm is performed,K-samples are performed on the data set,and K basic classifiers are constructed.Then,based on the proposed ambiguity measure,a decision tree is constructed by a Categorical Regression Tree(CART)algorithm to form a decision tree forest.Finally,random forests make data classification decisions through voting mechanisms.The experiment used a real data set in the UCI database,and each data set took three different classification indicators.Through a large number of comparisons with existing algorithms,the results show that:l)The algorithm based on test cost and information gain algorithm proposed in this paper is significantly more efficient than the algorithm based on a single test cost index.2)The new ambiguity measurement proposed in this paper has obvious advantages in the process of random forest construction;3)The algorithm can effectively solve the problem of unbalanced data and can solve the problem of different samples from the same node.
Keywords/Search Tags:Information gain, test-cost, Decision-tree, Cost-sensitive learning, Impurity measur
PDF Full Text Request
Related items