Font Size: a A A

Research On C4.5 Algorithm Based On Cosine Similarity And Weighted Pruning Strategy

Posted on:2018-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X C XiaFull Text:PDF
GTID:2348330533459889Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Into the 21 st century,with the rapid development of database technology and memory and other hardware capacity of the substantial expansion,our ability to collect data is more and more strong.In the face of the large amount of data,it is difficult to analyze and deal with these data effectively by using traditional data mining technology.Research on new data mining methods has become a hot issue that people are paying more and more attention to.C4.5 algorithm is one of the most classical algorithms in data mining algorithm,and it is a very important data mining algorithm.There are some defects of traditional C4.5 algorithm in redundant rules,large decision size and slow speed,according to those problems,a improved C4.5decision tree algorithm is proposed based on cosine similarity.First of all,information entropy of each attribute and gain rate are calculated,if any attribute of the information entropy difference of any two attribute value is in a small range,the cosine similarity of two attribute values are calculated.Then within the scope of the threshold value of similarity of attribute values are merged,the combined attribute information gain rate are recalculated.Finally.Based on the traditional C4.5 algorithm to calculate.A hospital data of General inspection is picked up for simulation.Results shows that the proposed algorithm can effectively reduce split attribute dimension,the size of the decision tree and redundant rules,also improve the classification speed.Although the above method can achieve the desired results,but in practical applications,there will be important attributes of the problem of loss,this paper proposes a new C4.5 algorithm and weighted cosine similarity pruning strategy basedon improved hybrid.Firstly,the attribute importance is sorted according to the existing knowledge,and then the cosine similarity is calculated according to the degree of importance.Finally,the final decision tree is obtained according to the attribute importance.Experimental results proved that the algorithm successfully retained the important attribute,settled that the contribution property problems disappear.
Keywords/Search Tags:data mining, C4.5, cosine-similarity, threshold, weighted pruning
PDF Full Text Request
Related items