Font Size: a A A

Effective Methods To Deal With Outlier Detection Problems In Static And Streaming Data

Posted on:2021-01-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Full Text:PDF
GTID:1368330614450994Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,the accessibility,expediency and trustworthiness of data are indispensable in contemporary society,and clean data of any kind has become the new human treasure.In many fields,the ability to maintain high-quality data has become paramount even though it is more challenging due to the volume and high speed of the data.It is evident that,data empower many industries and businesses to realize their optimal potential by offering valuable insights into their activities plus a better advantage ahead of their rivals.Companies now invest in data mining skills to get valuable insight from data.Data mining discovers hidden patterns in different kinds of data types.Detecting outliers is one such important data mining task that aims at identifying objects that deviate from the expected pattern of the normal data because outliers have the potential to dramatically influence the outcome of data analysis.Outlier detection is an essential problem that has been studied in a wide range of applications in diverse fields and different data types.Outliers have many potential sources,and identifying them in large datasets entails effective and efficient methods.The process of detecting outliers has become more challenging due to the advancement in the digital age.For instance,with the revolution of data from traditional batch data,we now see the advent of a large volume of data that is generated continuously at high speed,and dynamically.These kinds of data may comprise of redundant information,which often influences the efficiency and the overall performance of the outlier detection method.Some different methods and techniques have been proposed over the years with the use of different methodologies and algorithms to address outlier detection challenges.Some of the commonly encountered difficulties are related to the nature of the input data,outlier type,data labels,accuracy,and computational complexity in terms of the CPU time and memory consumption.Researchers continue to find better solutions to address these challenges,together with problems associated with detecting outliers efficiently.Towards this pursuit,as most of the traditional methods have some setbacks and limitations,this dissertation addresses some of the challenges in detecting outliers in different datasets and proposes effective methodologies to handle outliers in both static and streaming data.It conducts extensive experiments to evaluate the performance of the proposed techniques against other previous methods and discusses its salient findings.This dissertation contains five chapters with the first two Chapters serving as the basis of the work.The first chapter introduces the motivation and objectives of our study.Also,some foundational concepts of outlier detection including the definitions,causes and application areas.While the second chapter presents a comprehensive and organized review of the progress of outlier detection methods over the past two decades.We categorize these methods into different techniques from diverse outlier detection techniques such as distance and clustering-based approaches.In each category,we introduce some state-of-the-art outlier detection methods and further discuss them in detail in terms of their performance.Furthermore,we delineate their pros,cons,and challenges to provide researchers with a concise overview of each technique and recommend solutions and possible research directions.In the third part of this dissertation,among the different categories of outlier detection methods,we propose a statistical-based approach to solve the problem of detecting outliers.We offer optimal solutions that will enable the idea of detecting outliers more effectively with a high detection rate while minimizing the computational cost.To achieve this,we propose a Gaussian Mixture Model for Outlier Detection(GMMOD)for the parametric approach and Kernel Density Estimation for Outlier Detection(KDEOD)algorithms for the non-parametric approach.The fourth and fifth part extends the goal of detecting outliers in different data type-data streams.In the fourth part,we present a distance-based method.We propose a method called Micro-Cluster with Minimal Probing(MCMP),which is a hybrid approach.It offers a new distance-based outlier detection technique to minimize the computational cost in detecting distance-based outliers effectively.The proposed MCMP technique comprises of two approaches.Firstly,it adopts micro-clusters to mitigate the range query search.Then,to deal with the objects outside the micro-clusters,we propose the concept of differentiating between strong and trivial inliers.While in the fifth part,we present a clustering-based method.We propose a new method called CLustering for Outlier detection in Data Streams(CLODS),which is a clustering-based outlier detection approach that detect outliers in evolving data streams by first applying micro-clustering technique to cluster dense data points and effectively handle the data points within a window according to the relevance of their status to their respective neighbors or position.Both proposed methods improve the computational speed and memory consumption,while simultaneously maintaining the outlier detection accuracy.They outperform the state-of-the-art methods in both CPU time and memory consumption in the majority of the datasets.Finally,in the last part,we present some open research issues and challenges that will provide researchers with a clear path for the future of outlier detection methods.
Keywords/Search Tags:Outlier detection, Data Streams, Statistical-based, Distance-based, Clustering-based
PDF Full Text Request
Related items