Font Size: a A A

Study On C4.5 Algorithm Optimization Combined With K-means

Posted on:2021-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:X F XuFull Text:PDF
GTID:2518306461465434Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,huge amounts of data will be produced in every field.After preliminary processing of these massive data and analysis in a specific direction by means of data mining technology,some useful features and rules for specific fields can be obtained,which can guide industrial production and bring huge profits.C4.5 decision tree algorithm is one of the important branches of data mining algorithm,so it is of practical significance to study and improve C4.5 algorithm.In view of the problem of excessive time cost when C4.5 algorithm is used to calculate the continuous attribute information gain rate,it is proposed to use k-means algorithm to first calculate the clustering center where the continuous attribute value of C4.5 algorithm is valued,and then calculate the information gain rate of this attribute based on the obtained clustering center.Aiming at the problem that k-means algorithm randomly selects points as the initial clustering center,resulting in the instability of the final clustering center and too many times of calculation,two improved algorithms are proposed.The first is the Average-K-means algorithm,which first sorts all the values of the continuous attribute and obtains the sequence,and then divides the range of the sequence equally by K,taking the equal point K as the initial clustering center.The second is the Density-k-means algorithm,which sorts all the values of the continuous attributes and obtains the sequence,then finds the region with high density from the sequence,and takes the average value of these regions as the initial clustering center.Combining the above two improved algorithms with C4.5 algorithm,the improved C4.5 algorithm--Density-Average-K-means-C4.5 algorithm was further proposed.When calculating the continuous attribute information gain rate,the algorithm use the Density-K-means algorithm to calculate the cluster center under the condition of large sample size,whereas the algorithm uses the Average-K-means algorithm.The calculated clustering center replaces all values of the property to calculate the information gain rate of the property.The experimental results based on the two data sets of UCI show that compared with the original and improved C4.5 algorithm,the time required to establish the decision tree is greatly reduced and the accuracy is similar.
Keywords/Search Tags:C4.5 algorithm, K-means algorithm, Decision tree, Continuous property
PDF Full Text Request
Related items