Imbalanced Data Classification Based On The Influence Of Training Instances

Posted on:2021-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Pu

Full Text:PDF

GTID:2518306050472554

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

For a bi-class problem,the imbalanced data set has the characteristic that the number of instances of the two categories is obviously different.Existing classification methods aim at maximizing the overall classification efficiency of the model during the classification process and ignore the classification performance of the minority instances.However,in a variety of real-world scenarios,the minority instances have better mining value and are the objects that need to be paid attention to,for example,in network intrusion detection,those intruders with fraudulent and illegal activities should be discovered.Therefore,the classification problem of imbalanced data sets is a hot topic that needs to be studied by relevant researchers.This article starts from two angles of influence function and network classification model,and applies the improved influence function combined with the k nearest neighbor and the method based on network structure to the study of imbalanced data classification.The final classification rule is to divide the test instance into the class of training instance that has a greater cumulative impact on it.The simulation results show that the proposed algorithm has obvious advantages over traditional classification models.The innovations of this paper can be summarized as follows:(1)An imbalanced data classification method based on network structure is proposed in this paper aiming at the problem that the traditional network classifier will tend to the majority instances when applied to the imbalanced data classification.The method distinguishes the minority instances and majority instances,and redistributes the initial node influence calculated by the Pagerank algorithm to increase the model's attention to the minority instances,at the same time,aiming at the problem that the traditional network classifier treats the attributes of different class instances equally,the fuzzy entropy concept is used to calculate the weights for the instance attributes of each class in this paper,and it is used to calculate the local efficiency of the nodes and the physical characteristics between the instances.The simulation results show that the method can improve the classification performance of the minority instances to a certain extent while ensuring the classification accuracy of the majority instances.(2)An imbalanced data classification method based on the improved influence function and the k nearest neighbor is proposed in this paper aiming at the problem of ignoring the distribution characteristics of training instances and the use of the same k value for all test instances in the definition of traditional influence functions.The method not only considers the distance relationship between the training instance and the test instance,but also calculates the class representative ability of the training instance itself in define influence function.Specifically,first,the distance from the training instance to the center of the cluster and the intra-class distribution characteristics of the training instance are used as the initial class representation of the instance,and secondly,the concept of confidence is introduced to analyze the impact of the distribution of other class instances on the instance.Instances in different positions are effectively distinguished,and the true impact of the training instance on the test instance is accurately calculated.When selecting effective neighbors for each test instance,the method makes full use of the distribution characteristics of the test instance itself and the inherent information of the class of the neighbor instance,and adds the neighbor selection process to the k nearest instance to test instance,the goal is to find the nearest neighbor instance that can really participate in its category decision for each test instance.The simulation results show that the method achieves better classification performance than the traditional methods on the problem of imbalanced data classification.

Keywords/Search Tags:

Imbalanced Data, Classification, Influence Function, K Nearest Neighbor, Network Structure

PDF Full Text Request

Related items

1	Research On Improved K-nearest Neighbor Method For Imbalanced Data Set Classification
2	Study On Generalized Nearest Neighbor Pattern Classification
3	Research On Classification Algorithm For Imbalanced Data
4	Imbalanced Classification Methods For Complex Distribution Characteristics
5	Granular Computing-oriented Dynamic Neighborhood Imbalanced Data Classification Algorithm
6	Random K-Nearest Neighbor Algorithm With Application To Bankruptcy Prediction
7	Research On Classification Technology For Imbalanced Data Sets
8	Research Of Nearest Neighbor Classification Algorithm Based On Sample Selection
9	Research On Several Pattern Classification Methods Based On K-nearest Neighbor Criterion
10	Improvement Of KNN Algorithm Based On Weighted Data Partition And Imbalanced Data Set