Font Size: a A A

Research On Outlier Detection For Categorical Data

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z P SunFull Text:PDF
GTID:2428330611499745Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In data analysis,data which are inconsistent with the pattern of the entire data set often appears.These data are called abnormal data or outliers.Outlier detection is one of the most fundamental data analysis tasks in detecting rare events,exceptions,or some deviation from regular objects.Outlier detection has important applications in many areas,including information communications,statistics,financial fraud,network security,climate anomalies,etc.Although there are many methods for outlier detection in numerical data,only a few methods that can handle categorical data.Actually,categorical data are widely used in our lives.Since categorical data are usually unordered and discrete,it is difficult to define similarity,calculate nearest neighbors or calculate distances and densities in categorical data.Because of these problems,the categorical outlier detection is a challenging task.In this paper,the problem of categorical outlier detection is discussed and studied.Two categorical outlier detection methods based on entropy are proposed named as ODF(Outlier Detection Forest)and FAST-ODT(Fast Outlier Detection Tree),which is improved basing on ODF.A method named entropy difference is first introduced for measurement of data anomalies which is based on the change of dataset entropy.In traditional entropy difference methods,the value of entropy difference is related to the scale of data.To solve this problem,an improved entropy difference calculation is proposed.Next,this paper proposes a tree-based ODT(Outlier Detection Tree)algorithm,which is a component of the ODF algorithm.The ODT algorithm determines the abnormal degree of the attributes in the data by the value of the entropy difference,and then classifies the data according to the if-else rule of the tree model.The ODF algorithm constructs different ODT models by applying pruning method to the tree model.These ODT models is combined to form an ODF model.The ODF algorithm performs a final classification on the data points based on a combination of anormaly detection results in ODTs.Based on ODF,an improved algorithm named FAST-ODT is proposed.In addition,the algorithms proposed in this paper are implemented,and state-ofthe-art methods are compared to evaluate the performance of proposed algorithm through simulation experiments.In the problem of categorical outlier detection,the ODF algorithm can runs faster than other algorithms while guaranteeing outlier detection accuracy.The FAST-ODT algorithm further improves the ODF algorithm.The algorithm can guarantee the quality of outlier detection and has significant advantages in time performance compared with state-of-the-art algorithms.
Keywords/Search Tags:outlier detection, categorical data, entropy difference
PDF Full Text Request
Related items