Font Size: a A A

Research On Parallel Outlier Detection Method In Heterogeneous Distributed Environment

Posted on:2020-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z M ZhuFull Text:PDF
GTID:2428330602953946Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous progress of intelligent society,the data information of various industries such as transportation,medical care,finance,education,etc.has shown explosive growth,which makes knowledge discovery more difficult in large-scale data.Outlier detection is one of the important parts in the field of knowledge discovery,whose purpose is to identify the points which are abnormal but valuable from the data set.At present,the outlier detection algorithms are mainly oriented to a centralized processing environment.As the amount of data in the database increases,these algorithms cannot guarantee the efficiency of outlier detection.So,some scholars have proposed outlier detection algorithms for distributed environments.However,these distributed outlier detection algorithms are designed for homogeneous distributed environments.In practical applications,because of the differences in configuration of processors participating in distributed computing,heterogeneous distributed environments are more common.So,this makes existing distributed outlier detection algorithms designed for homogeneous distributed environments not well suited for heterogeneous distributed environments.To solve the problem above effectively,this paper proposes an outlier detection method for heterogeneous distributed environments.Specifically,the contributions of the paper are mainly:(1)A Grid-based Dynamic Data Partitioning method(GDDP)is proposed.The method improves the grid-based spatial partitioning method,firstly divides the given data set into a plurality of disjoint data sub-blocks according to the spatial position of the data points,and then dynamically allocates the computing tasks according to the computing power of each processor.This method not only ensures the full utilization of computing resources,but also accelerates the decision process of outliers to some extent.(2)Based on the GDDP method,a GDDP-based Outlier Detection Algorithm(GODA)is proposed to detect distance-based outlier in heterogeneous distributed environment.The algorithm is mainly divided into two phases:in the first phase,the whole data set is managed by an index in each local processor,and according to the order of each point in the index,the scanning order is determined.Then all the local outliers are computed by two scans;in the second phase,for all the candidate outliers,by using the distance relationship between each candidate outlier with its block and adjacent blocks,the processors that each candidate outlier needs a network communication are computed.And then sends them to the corresponding processor to obtain a final global outlier at the same time.(3)In the real heterogeneous distributed environment,the GODA algorithm proposed in this paper is compared with the existing PENL algorithm and BOD algorithm by using different data sets.The final experimental results show that compared with the existing homogeneous distributed outlier detection algorithm,the GODA algorithm proposed in this paper can effectively solve the outlier detection problem in heterogeneous distributed environments.
Keywords/Search Tags:Heterogeneous, Distributed Computing, Outlier Detection, Grid, Data Partitioning
PDF Full Text Request
Related items