Font Size: a A A

Research On Missing Data Recovery In Large-scale,Sparse Datacenter Traces

Posted on:2020-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:L F BiFull Text:PDF
GTID:2428330623956502Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The trace analysis for datacenter clusters holds a prominent importance for the datacenter performance optimization.However,due to the error and low execution priority of trace collection tasks,modern datacenter traces suffer from the serious data missing problem.Previous works handle the trace data recovery via the statistical imputation methods,which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes.To this end,we propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale,sparse datacenter traces(STDR).The proposed model recovers the missing data with tensor analysis theory.Besides,an attribute selection and a discretization optimize method are employed to improve the accuracy and reduce the computational complexity.The main contributions of the thesis are as follows.(1)A framework of datacenter trace missing data recovery.This paper analyzes the data missing characteristics in a representative Alibaba data center trace.Based on the results of the analysis,the proposed framework consists two main phases,First,the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute.Then,based on the selected attributes,a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC and the Tucker decomposition-based tensor completion method.(2)An Adjusting Mutual Information-based attribute selection method and an equal-width-binning-based data discretization method.The attribute selection method selects the strong correlated attributes to the value-missing attribute with the consideration of redundancy among the selected attributes,so as to improve the accuracy of the data recovery.The data discretization method adopts different discretization granularity search steps for different requirements of data attribute selection and tensor complementation to reduce the computational complexity.(3)A tensor-based data recovery method.A higher-order tensor is proposed to model the relationships between the selected attributes and the attribute with missing data.Then,the CANDECOMP/PARAFAC and the Tucker decomposition-based tensor completion method is adopted to recover the data,respectively.Among the methods based on tensor Tucker decomposition,we establish an auxiliary matrix for each dimension which model the relationship between the selected attributes and other complete attributes.The auxiliary matrices are decomposed simultaneously with the tensor to improve the accuracy of data recovery.(4)Performance evaluations of STDR on the Alibaba trace.Six data recovery baseline methods are used for performance evaluation.The experimental results show that compared with the two statistical data recovery methods,three machine learning-based data recovery methods,and the genetic algorithm-based data recovery method,STDR reduces the mean relative error by 81.3%,45.7%,and 47.3%,respectively.Furthermore,the trace recovered by STDR is analyzed and several new findings are obtained.The conclusions of existing analytical work based on incomplete Alibaba datacenter traces were complement and revised.
Keywords/Search Tags:datacenter trace, data recovery, tensor, attribute selection, data discretization
PDF Full Text Request
Related items