Research On Cost-sensitive Algorithms Based On Multi-objective Optimization

Posted on:2019-12-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y W Shi

Full Text:PDF

GTID:2438330572955967

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the application and development of computers in many fields,the scale of data generated by users in the use and data collection process has grown exponentially,and data mining technology has emerged.Cost-sensitive learning is one of the ten most challenging problems in data mining.The purpose of its related classification decision is to obtain the minimum test cost,time cost or misclassification cost,etc.The essence of cost-sensitive learning is to increase the cost index under the premise of keeping the classification ability of the decision-making system unchanged,and to reduce the total cost of classification to the greatest extent,so as to help people make decisions efficiently.Many decision tree algorithms have been proposed for cost-sensitive decision tree problems?C4.5 algorithm classification rules are easy to understand and high in accuracy,but data sets need to be scanned and sorted several times in the construction process,leading to lower efficiency of the algorithm.The CART tree algorithm is flexible,allowing partial error components,but it is very robust in the face of constructing problems such as missing values,large number of variables,etc.,but the resulting decision tree has a large branch and large scale.The previous decision tree classification study is a single introduction cost index,or the classification process is classified based on information entropy,and the results obtained are over-fitted.In the process of constructing unbalanced datasets,the cost-sensitive decision tree only considers the total classification cost and does not take into account the differences in the different samples of the unified node.When the measurement of the purity metrics is different,the results obtained are also quite different.Based on this,this paper proposes a multi-objective cost-sensitive decision tree combined with test cost and information gain,and proposes a new type of measure of impurity ambiguity for data imbalance.On the one hand,it effectively reduces the total cost of classification and improves the classification performance of the algorithm.On the other hand,it proposes a new type of ambiguity measure,which solves the problem of cost difference between different nodes and can solve the effective classification of unbalanced data.The algorithm research method of this paper is divided into the following two parts:1.The traditional cost-sensitive decision tree algorithm often only considers the misclassification cost,which is not applicable to the decision system with test cost.Therefore,this paper comprehensively considers the test cost and misclassification cost,and proposes two new cost-sensitive ID3 algorithms.The main idea of both is to use a new type of property splitting standard instead of a traditional,single information gain splitting property standard.The cost-sensitive decision tree algorithm(CT-ID3)and the test cost-information gain property splitting standard cost-sensitive decision tree algorithm(TIG-ID3)of the proposed full test cost attribute splitting criterion were compared with the traditional ID3 algorithm.2.Construct a random forest with the multi-objective cost-sensitive decision tree based on the test cost and information gain proposed in the above experiment,and propose a new type of measurement of impurity in the process of construction.Not only the total cost of the decision tree is considered,but also the cost difference of the same node for different samples is considered.Then,a random forest algorithm is performed,K-samples are performed on the data set,and K basic classifiers are constructed.Then,based on the proposed ambiguity measure,a decision tree is constructed by a Categorical Regression Tree(CART)algorithm to form a decision tree forest.Finally,random forests make data classification decisions through voting mechanisms.The experiment used a real data set in the UCI database,and each data set took three different classification indicators.Through a large number of comparisons with existing algorithms,the results show that:l)The algorithm based on test cost and information gain algorithm proposed in this paper is significantly more efficient than the algorithm based on a single test cost index.2)The new ambiguity measurement proposed in this paper has obvious advantages in the process of random forest construction;3)The algorithm can effectively solve the problem of unbalanced data and can solve the problem of different samples from the same node.

Keywords/Search Tags:

Information gain, test-cost, Decision-tree, Cost-sensitive learning, Impurity measur

PDF Full Text Request

Related items

1	Research On Construction Of Cost-sensitive Decision Tree
2	A Study On Improving Cost-sensitive Learning Based On Decision Trees
3	Self Adaptive Cost-sensitive Decision Tree Learning Methods
4	Cost-Sensitive Learning Method Research Based On Three-Way Decision
5	Research On Decision Tree Models Under Different Decision-making Environments
6	Research On Attribute Reduction Algorithm Of Test-cost-sensitive Rough Set
7	Research On Attribute Reduction And Decision Tree In Cost-sensitive Learning
8	A Study Of Telecom Customer Churn Prediction Model Based On Cost-Sensitive Decision Tree
9	Research On Software Defect Prediction Method Based On Cost Sensitive Learning
10	Studies On Uncertain Data And Cost Sensitive Learning