Font Size: a A A

Research On IForest Algorithm Optimization And Its Parallelization Based On LSH And Information Entropy

Posted on:2021-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:B W HouFull Text:PDF
GTID:2518306050471234Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology,massive data sources are continuously generated and grow exponentially.How to mine valuable information from a large amount of data has become the main research hotspot.However,traditional data mining is faced with a large amount of data and is subject to great constraints.With the introduction of distributed frameworks,the application of distributed computing frameworks such as Hadoop and Spark has become the main method to solve such problems,and the advantages of using clusters are great.To a certain extent,it improves the operation efficiency of traditional data mining technology and has good scalability,which is very important for the research of the information contained in a large amount of data.Anomaly detection generally refers to abnormal values that deviate from normal data in operation and maintenance.Anomaly detection is a common application in machine learning algorithms.This article focuses on the isolation forest(iforest)algorithm and its optimization methods in common anomaly detection algorithms,and parallelized design and implementation on the big data spark platform,thereby greatly improving the performance of the algorithm.This article first analyzes and studies the basic principles of the traditional iforest algorithm,the Spark framework platform and its parallel operation mechanism.The detection accuracy of the algorithm iforest is low,the execution efficiency is poor,it is sensitive to global sparse points,and it is not good at handling local relatively sparse points.The problem,combined with the existing optimization strategy,proposed two optimization strategies:based on locality sensitive hashing(LSH)spatial distribution data preprocessing method,based on the dimension entropy value of the data segmentation method.The main research results of this article are as follows:(1)With the continuous generation of data,traditional outlier detection techniques often fail when processing large amounts of high-dimensional data.Inspired by the hash method's characteristics of low storage and high query in high-dimensional space,a method based on LSH spatial distribution data preprocessing is proposed for preprocessing the sample data of the isolated forest algorithm.The LSH method groups the most similar data into a bucket,and replaces similar samples in all buckets with a weighted point.On the one hand,it can greatly reduce the number of sample data and increase the efficiency of the algorithm behind.On the other hand,The IForest algorithm can process higher-dimensional data and improve the application range and accuracy of the algorithm.The results of the proposed method optimization experiment are compared with the existing data in the UCI machine learning data set.The experimental results verify that the anomaly detection IForest algorithm after data preprocessing using the LSH method has greatly improved efficiency and accuracy.(2)Although the IForest algorithm has the characteristics of low time complexity and good detection effect,the algorithm is not stable enough,and its robustness to noise characteristics is poor.In 1948,Shannon proposed the concept of "information entropy" to solve the problem of measuring the amount of information.For a system,the more ordered,the lower the information entropy value.On the contrary,the higher the information entropy value.When the IForest algorithm constructs a decision tree,it randomly obtains an attribute in the data set for each node,and then generally obtains the cut value by using random or middle methods to split the node to construct a child node,which will "entropy".The concept is introduced into the acquisition of the cut value,which can make it easier to separate the outliers from the normal cluster values in advance.Information entropy can be used to feedback the distribution of each attribute in the sample data.It can be seen that the more uneven the attributes are,the more likely it is to select abnormal data samples,which has higher algorithm execution efficiency compared with the random selection at the front.In addition to optimizing the IForest algorithm itself,this paper aims to solve the problem that the algorithm is difficult to mine large amounts of data efficiently,and implements parallel design and implementation based on the Spark platform to improve the parallelism of the algorithm.Finally,the UCI data set was used to verify the experiment.The experimental results show that the optimized algorithm can greatly improve the efficiency and accuracy of the algorithm.
Keywords/Search Tags:IForest Algorithm, Spark, Parallel Computing, LSH, Entropy
PDF Full Text Request
Related items