Font Size: a A A

Optimization And Application Of C4.5 Decision Tree Algorithm

Posted on:2018-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:X X HuangFull Text:PDF
GTID:2348330512959274Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
C4.5 a lgorithm, whic h is one of the top ten c lassic a lgorithms in data mining, is classifie d algorithm for prediction. Optimization and application of C4. 5 algorithm is w idely exite d in various fie lds. Such as business decision, Condition forecast in medical domain and gene identification in the field of biology, etc. In order to improve correctness of feature selection and data processing ability of C4.5 a lgorithm, combine C4.5 a lgorithm w ith partic le swarm optimization(PSO) algorithm or fuzzy algorithm is the most popular methods of improve ment. This paper mainly aims at the logarithmic calculation, attribute correlation and redundancy computing proble ms of C4. 5 algorithm to optimize, and the improved algorithm is applied for the student English exam forecast.The C4.5 a lgorithm needs large number of logarithm operations, interference from attributes correlation, etc. So the average of GINI index between conditiona l attributes based C4.5 algorithm(GC4. 5) was put forward to solve the problems. Firstly, use Taylor series and equiva lent infinitesima l princ iple to simplify information gain ratio formula, w ith the "addition", "subtraction", "multiplication", "division" instead of logarithm operations, saved the time of the call logarithmic function. Secondly, introduce the average of GINI index between conditiona l attributes simplifie d information ga in ratio formula to deal w ith the error caused by condition attribute corre lation. The proposed GC4.5 had been evaluated on a large number of UCI data set. The experime nta l results show that GC4.5 performs better than some existing C4.5 optimization algorithms.The C4.5 a lgorithm a lways has irre levant attributes and corre lation proble ms, so improved C4.5 a lgorithm based on calculation of de pendenc y for attributes and PCA(RPC4.5) was put forward to solve these proble ms. Firstly, ca lculating attribute dependency between condition attribute and class attribute according to the attribute depende ncy calculation formula, de lete the condition attribute which dependence is very small, and avoid unre lated calculation. Secondly, the simplified data set will be processed by compression princ iple of PCA. After PCA handling, the data set attributes combination of princ ipal components are independent of each other, to solve the proble m of the influence of the correlation between attributes. Through test on a large number of UCI data set, the results show that, compared with some existing C4.5 optimization algorithms, the accuracy of RPC4.5 algorithm improved significantly, modeling speed of RPC4.5 algorithm has certain advantages.Performance prediction is current research focus in the fie ld of data mining. Owing to simple and understandability, the mode ling time is short, re latively high c lassification accuracy. Those characteristics make C4.5 a lgorithm is the first choice of the performa nce prediction algorithm is used. GC4.5 and RPC4.5 a lgorithm w ill be used in the English exam prediction of schools this paper. With the aid of the JAVA development platform Eclipse and data mining analysis tools WEKA, to precede application experime nts. Results show that, compared w ith C4. 5 algorithm, c lassification prediction results of the GC4.5 and RPC4.5 algorithm has higher accuracy, the mode ling time is shorter. Therefore, the improve ment of the C4.5 algorithm is feasible, and has a certain practicality.
Keywords/Search Tags:C4.5 algorithm, Taylor series, GINI index, dependency for attributes, PCA
PDF Full Text Request
Related items