Research And Implementation On Key Technologies Of Big Data Cleaning For EMU

Posted on:2016-09-10

Degree:Master

Type:Thesis

Country:China

Candidate:W M Yan

Full Text:PDF

GTID:2308330467472738

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the large-scale use of China’s high-speed EMU, various monitored data of EMU are showing explosive growth. At the same time, the working status data of EMU main components such as brake pads, wheels, bogies and so on is the basis of EMU fault diagnosis, life prediction, fault knowledge reasoning, etc. and it plays an important role to the development of information technology throughout the railway. However, data quality issues don’t get enough attention it deserves, observe the monitored EMU data and discover that, there are data incomplete, information redundancy, data invalid and other quality problems, which make the work based on the analysis of EMU data get error analysis results and affect the quality of information services. Therefore, the EMU data cleaning has a certain theoretical and practical significance.This paper mainly study the illegal data contained in the EMU data, namely outliers. Because the traditional cleaning method often has poor performance when dealing with large data, this paper introduces the Hadoop distributed computing framework, the framework of Map/Reduce programming model can perfectly combine with the algorithm of this paper. For EMU data is large, multi-dimensionality and has diverse data types, so this paper put forward grid-based LOF outlier detection algorithm. As most data are not outliers, so it is not necessary to detect the entire data set, and the grid-based LOF outlier detection algorithm can remove part of the data set which does not contain outliers, namely grid pruning, and then detect the remaining data set, which greatly reduces the time complexity. In this paper, the following improvements are proposed for grid-based LOF outlier detection algorithm. Firstly, puts forward the concept of clustering radius when the grid density definition in the pruning phase is not rigorous, the improvement can avoid leaving out outliers, so greatly improves the detection accuracy. Secondly, puts forward the concept of grid number for LOF algorithm can not combine with MapReduce programming model well, through the grid number, the entire data set is divided into a number of small data sets which are labeled, so LOF algorithm based on MapReduce can parallelly process data fast.The experimental results show that, the improved grid-based LOF outlier detection algorithm has better performance in outlier detection accuracy and time efficiency. Experiments also verify Hadoop cluster is superior dealing with large data sets. In conclusion, the research achievements of this paper provide a certain reference to EMU big data cleaning.

Keywords/Search Tags:

EMU, Big Data, Outliers, Grid pruning, LOF, Hadoop

PDF Full Text Request

Related items

1	Research On The Outliers Detection Algorithm
2	Online detection of outliers for data streams
3	Strategic targeting of outliers for expert review
4	The Design And Realization Of The Data Quality Management Platform Of Grid Assets System Based On Hadoop
5	Research And Application Of Grid Enterprise Big Data Platform
6	Research On Parallel Acceleration Algorithm Of Association Rules Based On Hadoop
7	Research On Density-Based Outlier Detection Over Uncertain Data
8	Research On Outliers Detection In Data Stream Based On Unsupervised Learning
9	Kmeans Analysis Of Massive Book Circulation Data Based On Hadoop
10	Mining Association Rules Among Outliers Based On Histogram And FP-growth