Research On Time Series Data Cleaning

Posted on:2019-07-22

Degree:Doctor

Type:Dissertation

Country:China

Candidate:A Q Zhang

Full Text:PDF

GTID:1368330590951539

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Errors are prevalent in time series data,such as obvious deviation in GPS trajectories etc.In particular,industrial data are often found with dirty or imprecise values.Taking the collected turbine data from a certain cooperation as an example,there are a large number of missing values,abnormal values,and mis-matching values whose time stamps cannot be aligned.24%(about 8 million)data points and 31%(approximately 5,000 units)of equipment cannot be stored in the database due to data errors in the wind field of a certain area each day,resulting in serious losses of data assets.Faced with these erroneous time series,in addition to keeping the dirty values,discarding all the errors and manual inspections,two types of cleaning algorithms that are widely used in databases can be used to automatically clean time series data,i.e.,smoothing based algorithm and model based algorithm.However,the above two cleaning algorithms performed bad on large spike errors,small errors and consecutive errors,the three most common error types The smoothing based algorithm may modify almost all the data values while model based algorithm is not able to build an accurate model w.r.t.the dynamic time series data.In order to improve data quality on time series data,we propose three types of algorithms Our main contents and contributions are summarized as follows:·For large spike errors,we propose a cleaning algorithm under speed constraint Utilizing the novel speed constraint,we devise a polynomial time algorithm for global optimum and linear time algorithm towards local optimum under a high efficient Median Principle.Besides,these algorithms support out-of-order arrivals of data points and adaptive window size·For small errors which can not be identified by the aforesaid algorithm,we propose a statistical based cleaning algorithm.Rather than using the minimum principle in data cleaning,we model the probability distribution of speed changes.to solve the cleaning problem on small errors.We devise exact algorithms and propose several approximate/heuristic methods to trade off effectiveness and efficiency.We also analyze the proper case to use such algorithms·For consecutive errors,we propose a cleaning method based on labelling.The proposed iterative minimum repairing(IMR)can effectively clean time series data with a small number of labelled data(say around 10%).Explicit analysis on conver-gence and efficient estimation of parameters in each iteration are provided.With incremental computation,we reduce the complexity of parameter estimation from O(n)to O(1).Experimental results on real business scenarios show that the above three cleaning methods can clean time series data efficiently and efficiently.After cleaning the time series data,the prediction error of average removal rate of wafers can be significantly reduced.

Keywords/Search Tags:

time series, data quality, data cleaning, data integration

PDF Full Text Request

Related items

1	Heterogeneous Data Sources Integration In Research And Application Of The Cleaning Strategy,
2	Data Cleaning In Data Integration
3	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
4	Study On Water Quality Time Series Data Mining And Application Integration
5	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
6	Research And Application Of Data Cleaning And Repairing Methods In Production Process
7	Data cleaning in the energy domain
8	Research On Key Technologies Of Time Series Cleaning
9	Research And Implementation Of Distribute Integration Tool Combining ETL With Data Cleaning
10	Time Series Data Mining Technology And Its Applied Research In The Prediction Of Water Quality