Font Size: a A A

Research On Key Technologies Of Temporal Data Cleaning

Posted on:2022-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X O DingFull Text:PDF
GTID:1488306569485734Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology,data is being produced and accumulated at an unprecedented speed.In the data process pipeline,namely data acquisition,collection,storage,analysis and application,people now pay more attention to the feature of the time-related dimension in data,which mainly refers to the temporal orders of data.The data with temporal orders contributes to the historical data modelling and valuable information extraction.However,the quality of temporal data is faced with multiple problems.Low quality temporal data not only leads to a large amount of manpower and material costs in data preprocessing,but also leads to the deviation and error results in the actual application of data and knowledge mining.The data cleaning technique is an effective method to improve temporal data quality.As the demand for high-quality data has grown stricter,researches and techniques on temporal data cleaning are faced with many challenges,including:(1)the time(currency)orders of data would become lost,misplaced,or unavailable.It results in the loss of part of temporal information of the data;(2)various types of error data may affect each other during the repairing phase,rather than completely isolated.Thus,it is challenged to repair the complex incorrect data either effectively or efficiently;(3)the diversity of data also leads to the various patterns of error data.Due to the complex causes of the occurrence of the errors and violations,it is difficult to identify and explain the real error problems.This thesis conducts a systemic research of temporal data cleaning,which aims to provide effective solutions to several key problems in data cleaning with urgent demand.Both the relational temporal data and multidimensional time series data are studied in this thesis.The contributions of the thesis are summarized as follows:Firstly,this thesis studies the currency order determining methods under the cases of incomplete timestamps.It achieves an integrated currency determining method to compute the currency orders among tuples with currency constraints.It then studies the multiple data cleaning on incompleteness and inconsistency with currency reasoning and determination.This thesis introduces a 4-step framework for errors detection and quality improvement in incomplete and inconsistent data without timestamps.A currency-related consistency distance metric is defined to measure the similarity between dirty tuples and clean ones more accurately.In addition,currency orders are treated as an important feature in the missing imputation training process.A thorough experiment on real-life datasets verifies that the method outperforms the existing advanced methods,especially in the datasets with complex currency orders,and improves the performance of data repairing with multiple quality problems.Secondly,this thesis proposes a correlation analysis based anomaly detection on multi-dimensional time series data.It first computes correlation values among sequences after standardization steps,and then a time series correlation graph model is constructed.Time series cliques are constructed according to correlation degree in the graph.Anomaly detection is processed within and out of a clique.Experimental results on a real industrial sensor data set show that the proposed method is effective in anomaly detection tasks in high dimensional time series data.Through contrast experiments,the proposed method is verified to have a better performance than both the statistic-based and the machine learning-based baseline methods.The method achieves reliable correlation knowledge mining between time series,which not only saves time costs,but also identifies abnormal patterns form complex conditions.Thirdly,this thesis defines an inconsistent subsequences problem in multivariate time series,and proposes an integrity data repair approach to solve inconsistent problems.The proposed repairing method consists of two parts:designing effective anomaly detection method to discover latent inconsistent subsequences in the IoT time series;and developing repair algorithms to precisely locate the start and end time of inconsistent intervals.A thorough experiment on two real-life datasets verifies the superiority of the proposed method when compared with other practical approaches.Experimental results also show that the method captures and repairs inconsistency problems effectively in time series in complex industrial scenarios.Finally,this thesis addresses the violation explanation problem for multivariate temporal data and proposes a 3-step self-contained method.The domain knowledge are formalized and utilized to identify the violation events by the violation of constraints.This thesis proposes set-cover-based violation explanation algorithms to discover the events reflected by violation features,and further develops knowledge update algorithms to improve the original knowledge set.Experimental results verify that the proposed method computes high-quality explanation solutions of violation data.Moreover,the update algorithms can effectively improve the existing incomplete knowledge set.The content of this thesis addresses several key problems of temporal data cleaning,and proposes a complete model method and algorithm implementation for each research point.It does cover the important steps in data cleaning problem,namely detection,location,repairing,and explanation.
Keywords/Search Tags:temporal data cleaning, violation detection, anomaly detection, rule-based data repairing
PDF Full Text Request
Related items