Research On Outlier Detection For Categorical Data

Posted on:2021-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Sun

Full Text:PDF

GTID:2428330611499745

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In data analysis,data which are inconsistent with the pattern of the entire data set often appears.These data are called abnormal data or outliers.Outlier detection is one of the most fundamental data analysis tasks in detecting rare events,exceptions,or some deviation from regular objects.Outlier detection has important applications in many areas,including information communications,statistics,financial fraud,network security,climate anomalies,etc.Although there are many methods for outlier detection in numerical data,only a few methods that can handle categorical data.Actually,categorical data are widely used in our lives.Since categorical data are usually unordered and discrete,it is difficult to define similarity,calculate nearest neighbors or calculate distances and densities in categorical data.Because of these problems,the categorical outlier detection is a challenging task.In this paper,the problem of categorical outlier detection is discussed and studied.Two categorical outlier detection methods based on entropy are proposed named as ODF(Outlier Detection Forest)and FAST-ODT(Fast Outlier Detection Tree),which is improved basing on ODF.A method named entropy difference is first introduced for measurement of data anomalies which is based on the change of dataset entropy.In traditional entropy difference methods,the value of entropy difference is related to the scale of data.To solve this problem,an improved entropy difference calculation is proposed.Next,this paper proposes a tree-based ODT(Outlier Detection Tree)algorithm,which is a component of the ODF algorithm.The ODT algorithm determines the abnormal degree of the attributes in the data by the value of the entropy difference,and then classifies the data according to the if-else rule of the tree model.The ODF algorithm constructs different ODT models by applying pruning method to the tree model.These ODT models is combined to form an ODF model.The ODF algorithm performs a final classification on the data points based on a combination of anormaly detection results in ODTs.Based on ODF,an improved algorithm named FAST-ODT is proposed.In addition,the algorithms proposed in this paper are implemented,and state-ofthe-art methods are compared to evaluate the performance of proposed algorithm through simulation experiments.In the problem of categorical outlier detection,the ODF algorithm can runs faster than other algorithms while guaranteeing outlier detection accuracy.The FAST-ODT algorithm further improves the ODF algorithm.The algorithm can guarantee the quality of outlier detection and has significant advantages in time performance compared with state-of-the-art algorithms.

Keywords/Search Tags:

outlier detection, categorical data, entropy difference

PDF Full Text Request

Related items

1	Research On Outliers Detection Algorithm Based On Categorical And Numerical Big Data
2	Research On Outlier Detection Based On Density Difference
3	Researches On Outlier Detection Algorithms For Categorical Matrix-object Data
4	The Research On Clustering Algorithm For Categorical Data Based-on Rough Set
5	Outlier Detection And Application Of Categorical Data In Spark Cluster
6	Research And Application Of Outlier Detection Algorithm
7	Research On Outlier Detection Algorithm For High-Dimensional Data Based On Angle And Entropy
8	Research On Subspace Clustering Algorithm For Categorical Data
9	Outlier Detection Based On Distance And Information Entropy Uncertainty
10	Research On Outlier Mining Method Oriented To Multidimensional Data