Font Size: a A A

Research On Outlier Detection For Unbalanced Data

Posted on:2018-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2348330533959825Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and network,more and more chaotic data are collected and stored,and there is no law.Data mining technology is to obtain valuable information from these large amounts of data.In recent years,the outliers detection has become an important research field in data mining.Outliers have obvious difference with other data.It can detect a small objects or attributes with abnormal behavior,these objects may be hidden behind the very valuable information or knowledge.Outlier detection has been widely used in many fields,such as fraud detection,intrusion detection,fault diagnosis and so on.Existing outliers detection method still exists some problems,for example,it does not take into account the data categories imbalances.The number of outliers are far less than the outliers,as a result,the unbalanced data processing method is introduced to detect outliers,and can be more effectively to detect outliers.However,the current methods of data processing for numeric data are mainly analyzed,they cannot effectively deal with data categories.In real life,we often encounter a large number of categories of data,we need to detect outliers in the data from these categories.Because the category type data does not have the geometric characteristics of numeric data,therefore,it cannot be directly using the existing methods for processing,we should put forward special processing method for the category type of unbalanced data.In order to solve the above problems,this article will research category type outliers detection problem of unbalanced data.First of all,the paper puts forward a kind of WODKM based on weighted K-modes overlap distance clustering algorithm;Second,the WODKM algorithm together,and SMOTE method presents a hybrid sampling algorithm for category type of unbalanced data HS_WODKM;Third,use HS_WODKM algorithm and integrated learning to detect outliers which can detect outliers effectively from the category type of unbalanced data.The work of this paper mainly includes the following aspects:First of all,the traditional K-modes clustering algorithm was improved,and put forward a kind of WODKM clustering algorithm based on weighted K-modes overlap distance.WODKM algorithm give full consideration to the importance of different attributes of clustering,the influence of different attributes when clustering are endowed with different weights,thus improve the clustering quality.The experimental results show that the WODKM algorithm is more efficient than the traditional method of K-modes algorithm in clustering precision.Secondly,in view of the category type of unbalanced data,this paper puts forward a hybrid HS_WODKM sampling algorithm.HS_WODKM algorithm solves the imbalance problem of categorical data by increasing the number of positive samples and reducing the number of negative samples.The improved SMOTE method is used to sample the positive samples and the WODKM algorithm is used to reduce the negative samples.By the common use of the above two kinds of sampling strategy,can effectively avoid the caused by the unbalance of sample classification fitting problem.The experimental results show that the type in category HS_WODKM,unbalanced data is valid.Third,the paper puts forward a kind of outliers detection method based on hybrid sampling and integration study,it can effectively detect the outliers in the data from categories imbalances.This method firstly HS_WODKM algorithm to mixed sampling of unbalanced data sets,a balanced data set is obtained,and then after preprocessing the data set on the use of integrated learning algorithms for detecting outliers.The experimental results show that our proposed method has better detection performance outliers.
Keywords/Search Tags:outliers, unbalanced data, K-modes, clustering, SMOTE, over sampling technique, hybrid sampling, ensemble learning
PDF Full Text Request
Related items