| Errors are prevalent in time series data,such as obvious deviation in GPS trajectories etc.In particular,industrial data are often found with dirty or imprecise values.Taking the collected turbine data from a certain cooperation as an example,there are a large number of missing values,abnormal values,and mis-matching values whose time stamps cannot be aligned.24%(about 8 million)data points and 31%(approximately 5,000 units)of equipment cannot be stored in the database due to data errors in the wind field of a certain area each day,resulting in serious losses of data assets.Faced with these erroneous time series,in addition to keeping the dirty values,discarding all the errors and manual inspections,two types of cleaning algorithms that are widely used in databases can be used to automatically clean time series data,i.e.,smoothing based algorithm and model based algorithm.However,the above two cleaning algorithms performed bad on large spike errors,small errors and consecutive errors,the three most common error types The smoothing based algorithm may modify almost all the data values while model based algorithm is not able to build an accurate model w.r.t.the dynamic time series data.In order to improve data quality on time series data,we propose three types of algorithms Our main contents and contributions are summarized as follows:·For large spike errors,we propose a cleaning algorithm under speed constraint Utilizing the novel speed constraint,we devise a polynomial time algorithm for global optimum and linear time algorithm towards local optimum under a high efficient Median Principle.Besides,these algorithms support out-of-order arrivals of data points and adaptive window size·For small errors which can not be identified by the aforesaid algorithm,we propose a statistical based cleaning algorithm.Rather than using the minimum principle in data cleaning,we model the probability distribution of speed changes.to solve the cleaning problem on small errors.We devise exact algorithms and propose several approximate/heuristic methods to trade off effectiveness and efficiency.We also analyze the proper case to use such algorithms·For consecutive errors,we propose a cleaning method based on labelling.The proposed iterative minimum repairing(IMR)can effectively clean time series data with a small number of labelled data(say around 10%).Explicit analysis on conver-gence and efficient estimation of parameters in each iteration are provided.With incremental computation,we reduce the complexity of parameter estimation from O(n)to O(1).Experimental results on real business scenarios show that the above three cleaning methods can clean time series data efficiently and efficiently.After cleaning the time series data,the prediction error of average removal rate of wafers can be significantly reduced. |