Font Size: a A A

Research On Outlier Detection Algorithm And Its Application For Large-scale Datasets

Posted on:2016-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:B JiangFull Text:PDF
GTID:2348330542476088Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of data,the size of the data has increased dramatically and obtain valuable information in such a huge data is the major challenge.Because the traditional outlier detection technology is for small data,so in the case of mass data the traditional algorithms expose the shortcomings.The purpose of this study is mainly aimed at the characteristics of large data sets to research outliers detection methods and solve the problems of the precision and efficiency on large-scale data.Through a detailed analysis of the characteristics and performance of traditional outlier detection algorithms and depth study on distributed computing and related technologies,detection method based on clustering is selected as a research point from a variety of detection methods based on different ideas and combined with distributed computing ideas to apply on large data sets for outlier detection.First,the K-medoids clustering algorithm has been analysised and then its complex processes are simplified.A sort-based index matrix is proposed to narrow the calculation area of center points to reduce the algorithm's complexity.According to the characteristics of clustering results,a new outlier detection algorithm based on distance is presented and make a detail explanation of the design and process steps about the algorithm.Then demonstrated the effectiveness of the proposed algorithm through experiments on real data sets.Then,through analyzing the structural characteristics of outliers detection algorithm that is proposed in this paper,the feasibility analysis of the parallelization method is made.After combining the characteristics of MapReduce parallel architecture,the general idea of parallelizing outlier detection algorithm with MapReduce is proposed.According to the parallel principle and process characteristics of algorithm,concrete implementation steps of the parallel algorithm is designed.Finally,Hadoop distributed platform is built to test the algorithm on its performance.The efficiency and stability of the parallel algorithm are significantly improved in the experimental,and the result proves that the improvement and parallel design of algorithm is very effective.
Keywords/Search Tags:large-scale data, outlier detection, distributed computing, Hadoop
PDF Full Text Request
Related items