KNN Approaches For Rare Category Data Mining

Posted on:2021-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2518306194975889

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid growth of data volume,people always have to face a variety of data sets.Among them,the unbalanced data sets occupy the vast majority.The unbalanced data sets mean that the categories in the dataset have different data samples.The categories containing the vast majority of data samples are called the majority categories,while the opposite is the rare categories containing a very small number of data samples.However,people tend to be more interested in data samples in rare categories rather than majority categories.Because the behavior of these rare data samples is more research-oriented,such as a small amount of illegal transactions hidden in the massive online transaction data,or a small amount of malicious attacks in many network access records.Therefore,the research on mining these rare data samples has high research value and practical significance.According to the existing literature,the problem of rare category data mining can be divided into two problems.The first problem is called rare category detection problem,which is defined as finding at least one data sample for each rare category to prove the existence of the rare category.According to whether users have prior knowledge of the dataset,the first kind of rare category data mining problem can be divided into rare category detection problem based on prior knowledge and priori-free rare category detection problem.The second problem is the extension of the first one,which is to find all data samples for each rare category,so as to study the properties of the rare categories better.The second kind of the problem is often called rare category identification problem.According to the different input data,it is often divided into rare category classification problem or rare category exploration problem.Aiming at the two problems mentioned above,this paper studies the problem of rare category detection and the problem of rare category identification,and gives the corresponding algorithms.The main work of this paper can be summarized as follows:(1)In order to solve the problem of rare category detection,we propose a rare category detection algorithm based on k-nearest-neighbor graph.By constructing k-nearest-neighbor graph on the original data set,the algorithm can detect the abrupt change of data sample distribution in the small area,and then select data sample as the rare category samples.(2)Aiming at solving the problem of rare category detection,we propose a rare category detection algorithm based on the centroid k-nearest neighbor.Compared with the traditional nearest neighbor relationship,the centroid k-nearest neighbor can more comprehensively reflect the data distribution around the target data sample,and detect the rare category more accurately.(3)Aiming at the problem of rare category identification,we propose a rare category identification algorithm based on local exploration.This algorithm can find all the data samples of the target rare category by continuously explore the local neighborhood of the target data sample.Compared with most of the current algorithms,this method does not need enough training sets and has a good effect on rare categories of arbitrary shape.Experiments based on real data sets show that the algorithm can quickly and accurately find all the data samples of the target rare category.

Keywords/Search Tags:

Rare category detection, k-nearest neighborhood relationship, centroid k-neighborhood, local exploration

PDF Full Text Request

Related items

1	Nearest Neighborhood-Based Rare Category Mining
2	Research On Outlier Detection Method Based On Nearest Neighborhood
3	Research On Classification Learning Based On Rough Sets
4	Study On Non-parametric Clustering Based On Natural Nearest Neighborhood
5	Study On Outlier Detection Based On K-nearest Neighborhood MST
6	Research On PSO Algorithm Based On Self-adaptive Neighborhood Explored And Population Centroid Lenarning Mechanism
7	Research On Heuristic Attribute Reduction Algorithm For Neighbourhood Rough Set
8	Research On Outlier Detection Based On Neighborhood Rough Sets
9	Local domains: Neighborhood planning and the interests of citie
10	Research On Outlier Mining Algorithms Based On Neighborhood Relation