Font Size: a A A

Research On Improvement Of Pruning Strategy Based On C4.5 Algorithm

Posted on:2017-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:L QiuFull Text:PDF
GTID:2308330488485674Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The classification algorithm is an important technology of data mining, its calculation speed, robustness, nterpretative nature, extensible property and accuracy of the classification model is the main evaluation index. The decision tree classification algorithm is an effective method to classify the sample set. The classification rules reflected by the decision tree is intuitive and easy to be understood. Decision makers use the decision tree model to make accurate predictions have been applied in many fields.In all kinds of decision tree algorithm, ID3 algorithm proposed by J.R.Quinlan is the most representative. The C4.5 algorithm which be widely used now is improved on the basis of the ID3 algorithm. Although the C4.5 has improved on the basis of the ID3 algorithm, but when in the face of the continuous attribute values of training examples, the calculation efficiency is low. In view of the problem of low efficiency, the scholars at home and abroad on how to calculate the optimal threshold value of the continuous attributes, and the improvement of the elimination of the computation of the information gain rate is improved. All the computational efficiency of the algorithm is improved.C4.5 algorithm has a strong ability to deal with noise data, regardless of the sample in the training sample contains the classification error, or the sample missing part of the attribute value. However, when the training sample set in the attribute deletion rate is higher, the C4.S algorithm to establish the decision tree model node number increased, classification accuracy also has a certain degree of decline. This paper, based on the C4.5 algorithm improved decision tree algorithm and pruning strategy. When the decision tree is generated, if all the properties of a subset are unknown, then a leaf node is returned directly, which is marked as unknown. In the obtained by the method of decision tree pruning, for a node should cut off, to consider two factors:one is cut off or not cut off of the node to the classification error rate, the second is the proportion of the nodes on the unknown node and all the leaf nodes. Obtained by the pruning strategy of the decision tree, the number of nodes must be less than or equal to C4.5 algorithm model and of missing attribute the high rate of training sample has a higher classification accuracy.In this paper, the improved algorithm is applied to the training sample set, which is based on the discrete and continuous attributes. Compared with the traditional C4.5 algorithm, the decision tree is obtained.
Keywords/Search Tags:Date mining, Decision tree, C4.5 algorithm, Pruning strategy
PDF Full Text Request
Related items