Research On Parallelization Of Isolation Forest Algorithm Based On Spark

Posted on:2020-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:G Liu

Full Text:PDF

GTID:2428330590450664

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Anomaly detection is one of the important research directions in the field of data mining and it has a wide range of applications in actual business scenarios.Traditional statistical and probabilistic models^[1],linear models^[2],and anomaly detection methods based on similarity^[3]mostly build models for normal instances,and normal instances that do not conform to the distribution of the model will be identified as anomalous instances.Therefore,there is a flooding effect.With the rapid development of the Internet,more and more devices are connected to the Internet,thus data is constantly generated and exploding.It is a very challenging problem to detect anomalies in massive data.Most of the traditional methods have high computational complexity and can only be applied to data with low demensions and small data volumes^[4].Isolation Forest（IForest）[4]is based on the idea that"abnormal points are isolated",which recursively cuts the data space to construct trees in the forest,and the abnormal points could be more easily accessed in the forest.IForest is an unsupervised non-parametric algorithm,which solves the problem of rare labeled data.In addition,IForest is a fast ensemble algorithm with linear time complexity and high precision[5].Apache Spark is a distributed computing framework based on elastic memory data sets[6],and its upper machine learning component ML provides the conditions for implementing parallel algorithms.This thesis implements the parallel IForest algorithm library（Spark-IForest）based on Spark.In the training phase,the internal construction process of each tree in IForest is independent.By sampling from the training data through maintaining the global segmentation information,IForestModel is build parallelly;In the prediction stage,there is no correlation between data,thus the prediction of abnormal instances could be done parallelly.In order to verify that Spark-IForest has improved the speed of anomaly prediction with sataisfactory accuracy,this paper has conducted experiments and given analysis.Through the performance test and expansive test,it could be concluded that Spark-IForest could achieve satisfactory results in terms of AUC,and in the multi-core parallel scenario,the speed of Spark-IForest is improved a lot compared with stand-alone Spark-IForest and Sklearn-IFores.In addition,the performance of Spark-IForest increases with the increase of parallelism within certain conditions.Therefore,Spark-IForest provides the possibility for fast anomaly detection in massive data scenarios.

Keywords/Search Tags:

Isolation Forest, Apache Spark, Anomaly Detection, Parallel Computing

PDF Full Text Request

Related items

1	Research On Online Anomaly Detection Method Of Network Data Stream Based On Isolation Forest
2	Research And Implementation Of Network Traffic Anomaly Detection Based On Spark Platform
3	Application Research Of Outlier Anomaly Detection Technology For Time Series Data
4	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
5	The Research Of Real-time Network Traffic Anomaly Detection Based On Spark Technology
6	Research On Anomaly Detection Based On Ensemble Learning Algorithms
7	Research Of Anomaly Detection Method Based On Hash Mapping And Isolation Principle
8	Anomaly Detection In Application Delivery Networks Based On Isolated Forest And Improved X-means
9	Application Of Random Forest In Cloud Computing Anomaly Detection
10	Design And Implementation Of Anomaly Detection System For Network Freight Transport Documents Based On Machine Learning