Font Size: a A A

Unsupervised Outlier Detection Techniques For Complex Data

Posted on:2020-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:H Z XuFull Text:PDF
GTID:2518306548495834Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Outlier detection is the process of identifying the rare and exceptional data objects.In the era of big data,outlier detection has immense applications in many real-world scenarios,e.g.,cyber security,finance,and biomedicine,etc.With the explosion of data,data is also becoming highly complex in multiple dimensions,posing serious challenges to outlier detection technologies.Firstly,obtaining high-quality data labels is expensive in many practical situations,and thus unsupervised outlier detection methods are more popular.However,how to capturing outlying behaviors in unsupervised manner is a challenging problem.Secondly,most of existing outlier detection methods are designed for data with only numerical features,and may have many potential limitations,e.g.,quality of data,underlying characteristic of data,and outlier form.The complexity of data considerably downgrades the performance of outlier detection on many real-world applications.Therefore,this thesis aims to propose unsupervised outlier detection techniques in complex data.This thesis makes the following major contributions:(1)This thesis proposes a novel noise-resilient outlier detection method for categorical data.Noise is unavoidably contained in real-world data.It is difficult to distinguish outliers and noise because both of them are rare.Especially,there is no reliable solution to separate noisy feature values from outlying feature values.Feature subspace-based methods are the main solution for the noisy data.However,a categorical feature may contain both outlying values and noisy values,these methods inevitably mix noisy values when retaining an entire feature and get suboptimal results in some data.This thesis introduces an unsupervised high-quality outlying feature value selection framework OUVAS and its instantiated algorithm RHAC.The proposed method employs the relationships between feature values to differentiate outlying values and noisy values,and proposes a novel idea to search relevant data subspace for noisy categorical data.The high-quality value subset can be directly used for outlier detection(RHAC?OD),RHAC?OD achieves 10%-19% AUC improvements on real-world datasets with different noise level compared to state-of-the-art outlier detectors.In addition,the feature value subset explored by RHAC can be used to perform feature selection(RHAC?FS).RHAC?FS-empowered outlier detection algorithms obtain5%-15% improvements over their bare version.(2)This thesis proposes a novel outlier detection method for Non-IID categorical data.The Non-IIDness of real-world data also greatly influence the performance of outlier detection algorithms that are based on IID assumption.Mining and utilizing multiple couplings in Non-IID categorical data is popular methods for handling these data.However,existing methods can only model pairwise primary value couplings and cannot discover real relationships that hide in high-order complex value couplings.Therefore,this thesis proposes an embedding-based complex feature value coupling learning framework for outlier detection(termed EMAC)and its instantiated algorithm SCAN.This thesis proposes the biased value coupling-based network embedding and the bidirectional partial outlier propagation model,which address the problem of learning,represent,and utilise high-order complex feature value couplings for outlier detection in Non-IID categorical data.SCAN obtains 8%-13% AUC improvements in Non-IID real-world data compared to the state-of-the-art outlier detection algorithms.(3)This thesis proposes a novel method for mixed-type data with heterogenous outliers.Mixed-type data contains both categorical features and numerical features,which is pervasive in many real-world applications.It is a challenge to handle these data because of the difference between categorical and numerical features.Many existing methods separately and independently evaluate outlierness in numerical and categorical feature space.However,they fail to adequately consider the behaviours of data objects in different feature spaces during their outlier scoring phases,often leading to suboptimal results.In addition,many outlier detection methods are inherently restricted by their outlier definitions to simultaneously detect both clustered outliers and scattered outliers.This thesis proposes a joint learning-based framework JLOD and its instantiated algorithm MIX for outlier detection in mixed-type data with heterogenous outliers.The proposed method uses ensemble score and inlier candidates as prior knowledge to drive next outlier scoring phase,and thus the algorithm can iteratively and jointly perform outlier scoring in numerical and categorical space.In order to detect both clustered and scattered outliers,the proposed outlying scoring phases capture the essential characteristic of outliers by evaluating outlierness via the deviation from the normal model generated by these inlier candidates.MIX achieves8%-36% AUC improvements over its state-of-the-art competitors in real-world mixed-type data.
Keywords/Search Tags:Outlier Detection, Noisy Data, Non-IID Data, Mixed-Type Data, Coupling Learning, Joint Learning
PDF Full Text Request
Related items