Font Size: a A A

Research And Implementation On Key Technologies Of Big Data Cleaning For EMU

Posted on:2016-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:W M YanFull Text:PDF
GTID:2308330467472738Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the large-scale use of China’s high-speed EMU, various monitored data of EMU are showing explosive growth. At the same time, the working status data of EMU main components such as brake pads, wheels, bogies and so on is the basis of EMU fault diagnosis, life prediction, fault knowledge reasoning, etc. and it plays an important role to the development of information technology throughout the railway. However, data quality issues don’t get enough attention it deserves, observe the monitored EMU data and discover that, there are data incomplete, information redundancy, data invalid and other quality problems, which make the work based on the analysis of EMU data get error analysis results and affect the quality of information services. Therefore, the EMU data cleaning has a certain theoretical and practical significance.This paper mainly study the illegal data contained in the EMU data, namely outliers. Because the traditional cleaning method often has poor performance when dealing with large data, this paper introduces the Hadoop distributed computing framework, the framework of Map/Reduce programming model can perfectly combine with the algorithm of this paper. For EMU data is large, multi-dimensionality and has diverse data types, so this paper put forward grid-based LOF outlier detection algorithm. As most data are not outliers, so it is not necessary to detect the entire data set, and the grid-based LOF outlier detection algorithm can remove part of the data set which does not contain outliers, namely grid pruning, and then detect the remaining data set, which greatly reduces the time complexity. In this paper, the following improvements are proposed for grid-based LOF outlier detection algorithm. Firstly, puts forward the concept of clustering radius when the grid density definition in the pruning phase is not rigorous, the improvement can avoid leaving out outliers, so greatly improves the detection accuracy. Secondly, puts forward the concept of grid number for LOF algorithm can not combine with MapReduce programming model well, through the grid number, the entire data set is divided into a number of small data sets which are labeled, so LOF algorithm based on MapReduce can parallelly process data fast.The experimental results show that, the improved grid-based LOF outlier detection algorithm has better performance in outlier detection accuracy and time efficiency. Experiments also verify Hadoop cluster is superior dealing with large data sets. In conclusion, the research achievements of this paper provide a certain reference to EMU big data cleaning.
Keywords/Search Tags:EMU, Big Data, Outliers, Grid pruning, LOF, Hadoop
PDF Full Text Request
Related items