Font Size: a A A

The Analysis And Optimization Of C5.0 Algorithm Of Decision Tree Based On Misclassification Cost

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2308330461483936Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the classification application of data mining, the decision tree algorithm is one of the most widely used algorithm.The algorithm is simple and efficient, the characteristics of high classification accuracy. But all the mistakes in the process of establishing classification model classification is equal treatment and led to the incorrect classification of the different cost value without discrimination in the process of modeling, the model error classification of the total cost value is higher. To solve above problems, this article introduced the concept of cost matrix. Through the analysis of the different types of miscalculation cost value、cost of the corresponding matrix and the classification of data mining in hospital patients achieved C5.0 algorithm in the process of optimization.Thus establish a miscalculation cost less in patients with predicted classification model.through experiments have verified the cost matrix can effectively reduce the predict miscalculation cost classification model.This article studies and analyzes a variety of techniques commonly used in the normal course of data mining and decision tree classification algorithm used for in-depth analysis. Research and analysis based on C5.0 decision tree algorithm to optimize the cost matrix and the classification applications in the hospital patient.It can classified according to the practical application of patient hospital cost matrix for data mining model, the extent and Boosting pruning algorithm were analyzed. Introduced in the optimization of the cost matrix analysis costly misjudgment on behalf of the error value COST (high)、the general consideration of misjudgment on behalf of the error value COST (middle)、the error value misjudgment on behalf of the Low Cost COST (low).Through an analysis of misjudgment consideration the value of the determination condition and to give a final comparative analysis COST (high)= 3, COST (in)= 2, COST (low)= 1. Optimization Analysis analyzes the extent in pruning pruning degree selected two reference values:the complexity and accuracy of classification tree models, experimental comparative analysis of the two reference values worthy of the optimal degree of pruning. Boosting algorithm to optimize the analysis carried out for the number of iterations over-fitting algorithm and problem analysis. By comparing the test samples found overfitting problem and do not use Boosting iterative algorithm in this modeling. On this basis, through the hospital’s inpatient clients for data sampling and pre-processing and modeling data extraction, using the C5.0 decision tree algorithm to establish a classification model in hospitalized patients and the use of test data to test the model analysis. At the same time, the model of customer relationship management systems used in hospitals in hospitalized patients’ classification module, the realization of the hospital CRM system data management module, capable of newly admitted patients hospitalized value classification.The innovation of this article is to study analyzed the new C5.0 decision tree algorithm. The predictive classification will take into account the cost of false positives and false generation of value is given the value conditions. The establishment of a cost matrix to guide modeling, and realized in the model of the overall error rate forecast little change in the situation to do misclassification minimum cost. Boosting discovered iterative algorithm will lead to the problem of over-fitting modeling data in Boosting algorithm analysis.The patient developed a classification model established while having a low degree of risk、good stability、customer relationship management to achieve the hospital’s treatment of the customer value of new patients were classified. But the model in the modeling data and test data classification error rates were 8.29% and 8.17% and the accuracy of classification can be further improved.
Keywords/Search Tags:decision tree, C5.0 algorithm, Misclassification cost, Cost matrix
PDF Full Text Request
Related items