Font Size: a A A

Researches On Abnormal Data Detection Algorithms With Adaptive K-Nearest Neighbor

Posted on:2021-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:R X FanFull Text:PDF
GTID:2428330620463307Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development and widespread application of the Internet have brought humanity into the true era of big data,and we are generating massive data all the time.However,due to the diversity and complexity of data sources,the collected data often contain some abnormal ones inevitably.These data may be noise(which will affect modeling and decision-making and need to be filtered),or they may appear occasionally(they will not affect decision-making,and can be ignored),or they may be new category samples that need special attention(they should be detected and identified for subsequent processing).Under different circumstances,abnormal data have different manifestations.This thesis focuses on two types of abnormal data: outliers and label noise.Outliers are objects that appear inconsistent with most data in the data set,which may have potential value.Outlier detection is to find the objects with exceptional pattern hidden in the data set.The label noise is a kind of error in the observation labels of data due to various reasons,which will increase the complexity of classification model and decrease the classification accuracy.At present,abnormal data detection algorithms based on nearest neighbor or density are used widely.But there is a common problem in these algorithms,that is,they are sensitive to neighbor parameter k.In most existing methods,parameter k is set artificially,and the same k value is adopted for all samples in the data set.If the value of k for different samples can be set adaptively according to the distribution characteristics of data sets,better detection effect of abnormal data will be obtained.This thesis conducts research on the above issues.The main works include:(1)Propose a personalized k-nearest neighbor(PKNN)outlier detection algorithm for unlabeled data sets or single-class data sets.Different from the setting mode in the existing methods,the nearest neighbor parameter k of PKNN is determined automatically by the algorithm according to the distribution characteristics of data but not manually assigned.It means that different samples may have different nearest neighbor parameters.In addition,the PKNN algorithm gives an improved average distance as discriminant measure of outliers.Even if the density distribution in the data set is different,it also has a good detection effect.(2)Propose a label noise filtering algorithm with personalized k-nearest neighbor(PKNN-NF)for the binary classification data sets.The positive and negative data are considered separately,so that the label noise detection problem is transformed into the outlier detection problem of two single-class data.And the k value is set in the same way as the PKNN.The PKNN-NF algorithm divides all samples into core samples and non-core samples with noise factor which is used to measure the probability whether the sample is label noise or not.The non-core samples will be taken as the candidates of label noise.Then,the noise is identified and filtered combining with the information of the nearest neighbor labels of the candidate samples.In this thesis,a new method with adaptive k nearest neighbor is proposed for outliers and label noise.On this basis,an adaptive k-nearest neighbor outlier detection algorithm and label noise filtering algorithm are proposed respectively,and the effectiveness of the algorithms are verified by experiments.The research results of this thesis may have certain significance and practical application value for the study of abnormal data detection.
Keywords/Search Tags:Outlier, Label Noise, Personalized k-Nearest Neighbor, Degree of Outliers, Noise Factor
PDF Full Text Request
Related items