Font Size: a A A

KNN Approaches For Rare Category Data Mining

Posted on:2021-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2518306194975889Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of data volume,people always have to face a variety of data sets.Among them,the unbalanced data sets occupy the vast majority.The unbalanced data sets mean that the categories in the dataset have different data samples.The categories containing the vast majority of data samples are called the majority categories,while the opposite is the rare categories containing a very small number of data samples.However,people tend to be more interested in data samples in rare categories rather than majority categories.Because the behavior of these rare data samples is more research-oriented,such as a small amount of illegal transactions hidden in the massive online transaction data,or a small amount of malicious attacks in many network access records.Therefore,the research on mining these rare data samples has high research value and practical significance.According to the existing literature,the problem of rare category data mining can be divided into two problems.The first problem is called rare category detection problem,which is defined as finding at least one data sample for each rare category to prove the existence of the rare category.According to whether users have prior knowledge of the dataset,the first kind of rare category data mining problem can be divided into rare category detection problem based on prior knowledge and priori-free rare category detection problem.The second problem is the extension of the first one,which is to find all data samples for each rare category,so as to study the properties of the rare categories better.The second kind of the problem is often called rare category identification problem.According to the different input data,it is often divided into rare category classification problem or rare category exploration problem.Aiming at the two problems mentioned above,this paper studies the problem of rare category detection and the problem of rare category identification,and gives the corresponding algorithms.The main work of this paper can be summarized as follows:(1)In order to solve the problem of rare category detection,we propose a rare category detection algorithm based on k-nearest-neighbor graph.By constructing k-nearest-neighbor graph on the original data set,the algorithm can detect the abrupt change of data sample distribution in the small area,and then select data sample as the rare category samples.(2)Aiming at solving the problem of rare category detection,we propose a rare category detection algorithm based on the centroid k-nearest neighbor.Compared with the traditional nearest neighbor relationship,the centroid k-nearest neighbor can more comprehensively reflect the data distribution around the target data sample,and detect the rare category more accurately.(3)Aiming at the problem of rare category identification,we propose a rare category identification algorithm based on local exploration.This algorithm can find all the data samples of the target rare category by continuously explore the local neighborhood of the target data sample.Compared with most of the current algorithms,this method does not need enough training sets and has a good effect on rare categories of arbitrary shape.Experiments based on real data sets show that the algorithm can quickly and accurately find all the data samples of the target rare category.
Keywords/Search Tags:Rare category detection, k-nearest neighborhood relationship, centroid k-neighborhood, local exploration
PDF Full Text Request
Related items