Font Size: a A A

Imbalanced Data Classification Based On The Influence Of Training Instances

Posted on:2021-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:S Y PuFull Text:PDF
GTID:2518306050472554Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
For a bi-class problem,the imbalanced data set has the characteristic that the number of instances of the two categories is obviously different.Existing classification methods aim at maximizing the overall classification efficiency of the model during the classification process and ignore the classification performance of the minority instances.However,in a variety of real-world scenarios,the minority instances have better mining value and are the objects that need to be paid attention to,for example,in network intrusion detection,those intruders with fraudulent and illegal activities should be discovered.Therefore,the classification problem of imbalanced data sets is a hot topic that needs to be studied by relevant researchers.This article starts from two angles of influence function and network classification model,and applies the improved influence function combined with the k nearest neighbor and the method based on network structure to the study of imbalanced data classification.The final classification rule is to divide the test instance into the class of training instance that has a greater cumulative impact on it.The simulation results show that the proposed algorithm has obvious advantages over traditional classification models.The innovations of this paper can be summarized as follows:(1)An imbalanced data classification method based on network structure is proposed in this paper aiming at the problem that the traditional network classifier will tend to the majority instances when applied to the imbalanced data classification.The method distinguishes the minority instances and majority instances,and redistributes the initial node influence calculated by the Pagerank algorithm to increase the model's attention to the minority instances,at the same time,aiming at the problem that the traditional network classifier treats the attributes of different class instances equally,the fuzzy entropy concept is used to calculate the weights for the instance attributes of each class in this paper,and it is used to calculate the local efficiency of the nodes and the physical characteristics between the instances.The simulation results show that the method can improve the classification performance of the minority instances to a certain extent while ensuring the classification accuracy of the majority instances.(2)An imbalanced data classification method based on the improved influence function and the k nearest neighbor is proposed in this paper aiming at the problem of ignoring the distribution characteristics of training instances and the use of the same k value for all test instances in the definition of traditional influence functions.The method not only considers the distance relationship between the training instance and the test instance,but also calculates the class representative ability of the training instance itself in define influence function.Specifically,first,the distance from the training instance to the center of the cluster and the intra-class distribution characteristics of the training instance are used as the initial class representation of the instance,and secondly,the concept of confidence is introduced to analyze the impact of the distribution of other class instances on the instance.Instances in different positions are effectively distinguished,and the true impact of the training instance on the test instance is accurately calculated.When selecting effective neighbors for each test instance,the method makes full use of the distribution characteristics of the test instance itself and the inherent information of the class of the neighbor instance,and adds the neighbor selection process to the k nearest instance to test instance,the goal is to find the nearest neighbor instance that can really participate in its category decision for each test instance.The simulation results show that the method achieves better classification performance than the traditional methods on the problem of imbalanced data classification.
Keywords/Search Tags:Imbalanced Data, Classification, Influence Function, K Nearest Neighbor, Network Structure
PDF Full Text Request
Related items