| The classification of imbalanced data widely exists in many fields such as finance,biomedicine,and information security.In practical application problems,such as credit risk assessment and disease detection,imbalanced data is usually coexisted with categorical and numeric data.Numeric coding of categorical data will introduce unreasonable ordering information,and assumes that the distance between different categorical values is equal.The characteristics of imbalanced data,such as class imbalance and class overlap,are the main reasons that affect the difficulty of classification,and are also the important reasons that affect the complexity of the data.Therefore,for hybrid imbalanced data that contains categorical and numeric data,under the premise of understanding the complexity of the data,a reasonable data combination mapping method is designed to reduce the data complexity,which is of great significance for improving the classification performance of hybrid imbalanced data.This thesis carries out research on hybrid imbalanced data,and its main research works are shown in the following three aspects:(1)The data complexity is the key factor affecting the performance of the classification.Aiming at the problem of being difficult to directly measure the complexity of categorical data,this thesis considers the characteristics of features and class labels,and proposes a set of complexity measures for hybrid imbalanced data from three perspectives by using HVDM(Heterogeneous Value Difference Metric)distance measure.These measures effectively solve the problem that it is difficult to directly measure the data complexity of hybrid imbalanced data with categorical and numeric data.Then the validity of the proposed measure is verified through experiments,and the conclusion is finally drawn that the complexity of imbalanced data can be judged by using the difference of the complexity measures between the majority and the minority.(2)For hybrid imbalanced data with higher complexity,traditional undersampling is prone to lose sample information,and oversampling is likely to aggravate the class overlapping,overfitting and other problems.Based on the characteristics of categorical data,this thesis focuses on the imbalance of data and class overlap,and proposes ReSC data combination mapping method.By designing corresponding sample combination schemes,numeric coding of categorical data is avoided,and overlapping samples between classes are reduced,thereby reducing the complexity of hybrid imbalanced data.This thesis validates the rationality and effectiveness of the data combination mapping method ReSC through theoretical analysis and experimental analysis.(3)In the financial field,credit risk assessment has the characteristics of hybrid imbalanced problem,which is used as application research of this thesis.ReSC data combination mapping method is used to preprocess it,and conducts experimental analysis on it from the perspective of the data complexity and classification performance.Finally,the feasibility of the complexity measures and the ReSC data combination mapping method in actual application scenarios is verified through experiments.By analyzing the complexity of the hybrid imbalanced data to understand the data,the ReSC data combination mapping method is used to reduce the data complexity and solve the problem that it is difficult to directly process the categorical data.The research in this thesis has important theoretical and practical significance for the imbalanced classification of hybrid data. |