Font Size: a A A

An Improved C4.5Algorithm And Application

Posted on:2014-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2268330401483785Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology, there is an urgentneed to extract useful information from the vast amounts of data technology. Datamining has become one of the most popular information technologies. C4.5algorithmis the most classical of ten classical algorithms for data mining algorithms. Datamining technology plays a very important role with the high utilization rate. C4.5algorithm is a decision tree algorithm based on classification rules, which is presentedin the form of a tree. C4.5algorithm improves ID3algorithm, based on informationgain ration instead of information gain as the standards of the selected root attribute,overcoming the deficiencies of the bias select value attribute when the attribute isselected using information gain, which is useful to discretize continuous attributes.The most important feature of the C4.5algorithm is the contribution rules easier tounderstand, the achievements of those who do not need to know any mining objects inyour field of expertise, and fast classification classifier with high accuracy. C4.5algorithm has now been widely applied to various fields of economy, industry,medicine, agriculture, etc., so the C4.5algorithm research is significantly important.C4.5algorithm inadequacies exist in many places. C4.5algorithm in data redundancymay result in the complexity of the algorithm is too large. In this paper C4.5algorithm has been improved in these aspect, and renamed R-C4.5algorithm.The algorithm specific improvements: calculate the elements in each attributeinformation entropy, compare the same property value of each information entropy. Ifvalues are similar, then calculate the similarity of the set of elements; if the similaritycoefficient is high, then the description of the nature of the two elements of the sameor similar, the two elements merge to form a new element. Similarity calculation usesimproved JACCARD coefficient. The aim of such the change is not the simplecomparison of two similar degrees on the number of elements in the collection, butcompares similar degrees of collection elements in proportion.The improvement of the C4.5algorithm enhanced the procession mechanism. With the attribute of information entropy reduction, this removed redundant attributes to reduce the complexity of the algorithm, which greatly improving the accuracy. Thispaper not only improved C4.5algorithm, but also improved the calculation ofJACCARD coefficient in similarity collections. The similarity calculation is no longerthe ratio of the same number elements in collections, which changed to the ratio ofelements proportions in the collection. The purpose of such improvement is to avoiddue to the total number of selected, which led to an error of judgment.
Keywords/Search Tags:C4.5Algorithm
PDF Full Text Request
Related items