Font Size: a A A

Research On Random Forest Similarity Algorithm

Posted on:2019-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2428330566989338Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the field of machine learning,random forest are an important and common data mining method.Random forest not only has high classification performance,but also has the characteristics of fewer parameters to be adjusted,fast and efficient calculation,no worry about overfitting,and strong ability to tolerate noise.Random forest has been widely applied in various fields and achieved great success because of its good performance.It has attracted widespread attention.Although many scholars have conducted extensive research on random forest and have achieved many remarkable results,random forest still has some limitations and deficiencies,and has some room for improvement.Firstly,on the basis of the study of the existing calculation methods for the sample similarity of random forest,two improved calculation methods are proposed,which are the method of sample similarity calculation based on the characteristic importance and the method of sample similarity calculation based on the same attributes on the decision tree.The former is to associate the similarity between two samples that are located in the same leaf node with the position of the leaf node;the latter is to consider the case where the samples fall on different leaf nodes whose class labels are consistent,and associate the similarity between two samples with the number of identical attributes in the decision tree.Secondly,in view of the shortage of random forest in dealing with imbalanced data and the marginalization of SMOTE algorithm in selecting new negative samples,KMS_SMOTE algorithm is proposed.KMS_SMOTE algorithm firstly uses K-Means algorithm to classify the original negative samples into two categories and calculates their respective center points,then starting from the two central points,selects new negative samples,which makes the selected new negative samples converge to the center of the original negative class,and finally uses SMOTE algorithm on the new negative samples to get the new data set.This method effectively solves the defects of SMOTE algorithm,and improves the classification performance of random forest algorithm.Finally,using the data sets of UCI machine learning database,the improved calculation method of the similarity of random forest samples and KMS_SMOTE algorithm are carried out respectively,and the validity of the improved calculation method of the similarity of the samples and KMS_SMOTE algorithm are verified.
Keywords/Search Tags:random forest, the similarity of the samples, imbalanced data, SMOTE, K-Means
PDF Full Text Request
Related items