Font Size: a A A

Research And Application On CFS-HDRF Classification Algorithm For Imbalanced Data Set

Posted on:2021-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2428330629988939Subject:Engineering
Abstract/Summary:PDF Full Text Request
Imbalanced datasets are an important type of data in the field of data mining research,and they have received widespread attention in application areas such as customer churn,credit evaluation,and anomaly detection.Random forest is an ensemble learning classification method that has excellent performance in solving imbalanced datasets classification problems.However,the random forest uses the Gini coefficient as the feature selection and node splitting algorithm of the base classifier decision tree.This algorithm has a class distribution skew sensitivity to unbalanced data,which affects the classification effect of the data.In view of this problem,the content of this paper mainly covers the following aspects:First,combining Helinger distance and Helinger decision tree algorithm to discuss and verify the inbalanced and insensitive class characteristics of the algorithm,and validate the effect and evaluation index of Helinger random forest algorithm through experiments.Aiming at the problem of skew sensitivity of Gini coefficients,as the Heilinger distance is not sensitive to class imbalance,it is adopted as the feature selection and node splitting criterion for the decision tree to explore the correct rate and the effect of Kappa analysis on classification of imbalanced datasets.Through experiments,it is found that the Heilinger random forest has a good effect on the classification of imbalanced datasets,but there exist the problem of the lack of the processing of feature imbalance and feature redundancy and unreasonable evaluation index.Second,constructing the Heilinger random forest algorithm based on the feature selection of association rules.Due to the imbalance of the characteristics of a small number of samples derived from class imbalances,the majority of sample classes result in overfitting.Aiming at the problem of the lack of feature imbalance processing and unreasonable evaluation indicators in the experiments of the random forest of Heilinger,association rules feature selection is adopted to deal with feature imbalanced problem.With the advantage that it can reduce redundant features and reduce the number of calculations of the Heilinger distance of the features and has the possibility to reduce the number of nodes and the height of the tree,this paper builds the Heilinger random forest algorithm based on association rule feature selection.The algorithm uses precision,recall ratio,and F1 values to evaluate the performance of the algorithm.Experiments show that the Heilinger random forest algorithm based on association rule feature selection has good classification results.Third,building the prototype system design for performance evaluation of software engineering learning team based on CFS-HDRF algorithm.As existing software engineering learning team performance evaluation data set is an imbalanced data set,this paper studies and applies this problem as an imbalanced classification problem.This paper applies the CFS-HDRF algorithm to the design of a prototype system for performance evaluation of software engineering learning teams.Through the analysis of system effects,compared with the existing RF algorithms,the prototype system of software engineering learning team performance evaluation based on the CFS-HDRF algorithm has better results.
Keywords/Search Tags:Imbalanced datasets, Heilinger distance, Feature selection of association rules, Random forest, Software engineering learning team performance evaluation
PDF Full Text Request
Related items