Font Size: a A A

Research On Parallelization Of Isolation Forest Algorithm Based On Spark

Posted on:2020-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:G LiuFull Text:PDF
GTID:2428330590450664Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Anomaly detection is one of the important research directions in the field of data mining and it has a wide range of applications in actual business scenarios.Traditional statistical and probabilistic models[1],linear models[2],and anomaly detection methods based on similarity[3]mostly build models for normal instances,and normal instances that do not conform to the distribution of the model will be identified as anomalous instances.Therefore,there is a flooding effect.With the rapid development of the Internet,more and more devices are connected to the Internet,thus data is constantly generated and exploding.It is a very challenging problem to detect anomalies in massive data.Most of the traditional methods have high computational complexity and can only be applied to data with low demensions and small data volumes[4].Isolation Forest(IForest)[4]is based on the idea that"abnormal points are isolated",which recursively cuts the data space to construct trees in the forest,and the abnormal points could be more easily accessed in the forest.IForest is an unsupervised non-parametric algorithm,which solves the problem of rare labeled data.In addition,IForest is a fast ensemble algorithm with linear time complexity and high precision[5].Apache Spark is a distributed computing framework based on elastic memory data sets[6],and its upper machine learning component ML provides the conditions for implementing parallel algorithms.This thesis implements the parallel IForest algorithm library(Spark-IForest)based on Spark.In the training phase,the internal construction process of each tree in IForest is independent.By sampling from the training data through maintaining the global segmentation information,IForestModel is build parallelly;In the prediction stage,there is no correlation between data,thus the prediction of abnormal instances could be done parallelly.In order to verify that Spark-IForest has improved the speed of anomaly prediction with sataisfactory accuracy,this paper has conducted experiments and given analysis.Through the performance test and expansive test,it could be concluded that Spark-IForest could achieve satisfactory results in terms of AUC,and in the multi-core parallel scenario,the speed of Spark-IForest is improved a lot compared with stand-alone Spark-IForest and Sklearn-IFores.In addition,the performance of Spark-IForest increases with the increase of parallelism within certain conditions.Therefore,Spark-IForest provides the possibility for fast anomaly detection in massive data scenarios.
Keywords/Search Tags:Isolation Forest, Apache Spark, Anomaly Detection, Parallel Computing
PDF Full Text Request
Related items