Research On Outlier Detection Algorithm And Its Application For Large-scale Datasets

Posted on:2016-10-13

Degree:Master

Type:Thesis

Country:China

Candidate:B Jiang

Full Text:PDF

GTID:2348330542476088

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the explosive growth of data,the size of the data has increased dramatically and obtain valuable information in such a huge data is the major challenge.Because the traditional outlier detection technology is for small data,so in the case of mass data the traditional algorithms expose the shortcomings.The purpose of this study is mainly aimed at the characteristics of large data sets to research outliers detection methods and solve the problems of the precision and efficiency on large-scale data.Through a detailed analysis of the characteristics and performance of traditional outlier detection algorithms and depth study on distributed computing and related technologies,detection method based on clustering is selected as a research point from a variety of detection methods based on different ideas and combined with distributed computing ideas to apply on large data sets for outlier detection.First,the K-medoids clustering algorithm has been analysised and then its complex processes are simplified.A sort-based index matrix is proposed to narrow the calculation area of center points to reduce the algorithm's complexity.According to the characteristics of clustering results,a new outlier detection algorithm based on distance is presented and make a detail explanation of the design and process steps about the algorithm.Then demonstrated the effectiveness of the proposed algorithm through experiments on real data sets.Then,through analyzing the structural characteristics of outliers detection algorithm that is proposed in this paper,the feasibility analysis of the parallelization method is made.After combining the characteristics of MapReduce parallel architecture,the general idea of parallelizing outlier detection algorithm with MapReduce is proposed.According to the parallel principle and process characteristics of algorithm,concrete implementation steps of the parallel algorithm is designed.Finally,Hadoop distributed platform is built to test the algorithm on its performance.The efficiency and stability of the parallel algorithm are significantly improved in the experimental,and the result proves that the improvement and parallel design of algorithm is very effective.

Keywords/Search Tags:

large-scale data, outlier detection, distributed computing, Hadoop

PDF Full Text Request

Related items

1	Research And Implementation Of Integration Of R Language And Hadoop
2	Research Of Outlier Detection Algorithm Based On Hadoop
3	Research On Parallel Outlier Detection Method In Heterogeneous Distributed Environment
4	Research On Local Outlier Detection Algorithm
5	Research On Distributed SVM Algorithm Based On Hadoop Platform
6	Research On Distributed-Memory Ray Tracing For Large-Scale Rendering
7	Hadoop Based Parallel Genetic Algorithm For Large Scale VRPTW
8	Research On The Key Technology Of Processing Large Data Based On Hadoop
9	The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform
10	Distributed Computing For Large-scale Data With Group Dantzig Selector