Font Size: a A A

Isolated Forest Algorithm Based On Qualitative Data Clustering

Posted on:2022-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:M H ChenFull Text:PDF
GTID:2518306539981429Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the efficiency of obtaining data is constantly improving.How to identify outliers which are completely different from other samples from massive data has become an important issue to be considered in production activities.Nowadays,many different anomaly detection schemes have been proposed to solve the problem of outlier recognition.However,these methods have different defects,such as the requiring massive size dataset for training or high relying on the parameters selection.Compared with other anomaly detection algorithms,the Isolation Forest has several advantages,such as low time complexity,only small data sets for training,less parameter selection and so on.Meanwhile,the problem is that the test results may be inaccurate because of randomly selecting attributes to divide samples in the training process.In order to solve the above problems,this thesis uses rough set theory and rough set to judge the importance of different attributes.Combined with isolated forest algorithm.An isolated forest algorithm based on qualitative data clustering is proposed.The specific work is as follows:(1)in the Isolation Forest.The process of selecting attributes to divide samples according to their size is a completely random strategy.When this method constructs an Isolation tree,it may ignore the attributes that have a great influence on the results and choose the attributes that have a low influence on the results,thus resulting in inaccurate detection results.In this paper,the theory of using clustering results to calculate the importance of different attributes to information systems in Qualitative Data Clustering is chosen,and the relatively important attributes are screened out for constructing Isolation trees.Experiments show that the effect of this method is improved compared with other methods.(2)In order to prove the effectiveness of this method in practical problems,a real credit card data set is used to detect fraudulent transactions with the proposed method.And in this process,for data sets of different magnitudes,this method has made some improvements in the implementation details of calculating the importance of attributes.The data set is divided into several sub-datasets,and then the importance on the subdatasets is calculated by sampling respectively,and the importance of attributes is calculated by integrating multiple results.Finally,the effectiveness of this method is proved by experiments.
Keywords/Search Tags:Anomaly Detection, Isolation Forest, Rough Set, Clustering Algorithm
PDF Full Text Request
Related items