Study On C4.5 Algorithm Optimization Combined With K-means

Posted on:2021-12-03

Degree:Master

Type:Thesis

Country:China

Candidate:X F Xu

Full Text:PDF

GTID:2518306461465434

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data,huge amounts of data will be produced in every field.After preliminary processing of these massive data and analysis in a specific direction by means of data mining technology,some useful features and rules for specific fields can be obtained,which can guide industrial production and bring huge profits.C4.5 decision tree algorithm is one of the important branches of data mining algorithm,so it is of practical significance to study and improve C4.5 algorithm.In view of the problem of excessive time cost when C4.5 algorithm is used to calculate the continuous attribute information gain rate,it is proposed to use k-means algorithm to first calculate the clustering center where the continuous attribute value of C4.5 algorithm is valued,and then calculate the information gain rate of this attribute based on the obtained clustering center.Aiming at the problem that k-means algorithm randomly selects points as the initial clustering center,resulting in the instability of the final clustering center and too many times of calculation,two improved algorithms are proposed.The first is the Average-K-means algorithm,which first sorts all the values of the continuous attribute and obtains the sequence,and then divides the range of the sequence equally by K,taking the equal point K as the initial clustering center.The second is the Density-k-means algorithm,which sorts all the values of the continuous attributes and obtains the sequence,then finds the region with high density from the sequence,and takes the average value of these regions as the initial clustering center.Combining the above two improved algorithms with C4.5 algorithm,the improved C4.5 algorithm--Density-Average-K-means-C4.5 algorithm was further proposed.When calculating the continuous attribute information gain rate,the algorithm use the Density-K-means algorithm to calculate the cluster center under the condition of large sample size,whereas the algorithm uses the Average-K-means algorithm.The calculated clustering center replaces all values of the property to calculate the information gain rate of the property.The experimental results based on the two data sets of UCI show that compared with the original and improved C4.5 algorithm,the time required to establish the decision tree is greatly reduced and the accuracy is similar.

Keywords/Search Tags:

C4.5 algorithm, K-means algorithm, Decision tree, Continuous property

PDF Full Text Request

Related items

1	K-means Based On Binary And Svm Decision Tree Algorithm Of Data Mining Research
2	The Study Of Trading Behavior Of Investors Based On K-means Algorithm And Decision Tree Model
3	Research On Telecom Lte Users Churn Algorithm Based On Data Mining
4	Data Mining Technology In Human Resource Management
5	Research On Improving Of Decision Tree ID3 Algorithm
6	Improvement And Research Of C4.5 Algorithm Based On K-means
7	Information-gain Based Quantization Algorithm And Its Application In Decision Tree Study
8	The Decision Tree Algorithm And Its Application On Employment Of Undergraduate Students
9	CRM Research Based On The Decision Tree Classification Algorithm
10	Decision Tree Algorithm And Application