Font Size: a A A

Research On The Key Technology Of Massive Time Series Processing

Posted on:2018-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:1310330512967545Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of new technologies such as sensor networks,Inter-net of Things,cloud data center and mobile Internet,etc.,has led to explosive growth of time series data.Compared with the traditional data,the time series data has its characteristics.First,the size of time series data is massive,and the data will be emitted continuously,this will cause a long response time for processing platform because of the high frequency data stream,too long sequence and massive data.Second,these scenarios need to monitor a large number of perfor-mance indicators,due to the high dimension and characteristics of diversification,the processing efficiency and index precision need to be continuously improved.In the time series data process-ing platform,there are a lot of problems to be solved.For three typical data processing mode(batch processing,online analytical and real-time processing),we make a further research on similarity join for MapReduce-based massive time series data,correlation coefficient estimation for HBase-based massive time series data,multiple continuous queries of data stream;and for Hadoop cluster,we study the flow scheduling problem in the shuffle stage.The main work and contributions are as follows.(?)We address the problem of scaling up similarity join for general metric distance func-tions using MapReduce.First,we propose a novel index structure,Similarity Join Tree(SJT),which partitions data based on the underlying data distribution,and distributes similar records to the same group.Different from existing approaches,SJT can prune a large number of com-parisons within Reduce tasks by utilizing the by-product results generated in partitioning data.Then,to avoid the staggler Reduce tasks,we design a graph partition algorithm by extending the well known Fiduccia-Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in Reduce tasks.Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.(?)In order to efficiently calculate the correlation coefficient of long time sequence on HBase in real time,we first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient,named as DCE.In order to further Reduce the cost of I/O,we extend the DCE algorithm,and propose the ADCE algorithm,which can estimate the correlation coefficient quickly in an iterative manner.Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.(?)When there are a large number of aggregation queries,the data stream system may suffer from scalability problems.To address this problem,we propose collaborative aggregation which promotes aggregate sharing among the windows so that repeated aggregate operations can be avoided.Different from the previous approaches in which the aggregate sharing is restricted by the window pace,we generalize the aggregation over multiple values as a series of reductions.Therefore,the results generated by each reduction step can be shared.The sharing process is for-malized in the feed semantics and we present the compose-and-declare framework to determine the data sharing logic at a very low cost.Experimental results show that our approach offers an order of magnitude performance improvement to the state-of-the-art results and has a small memory footprint as the number of queries increases.(iv)In order to improve the utilization of network resources in Hadoop cluster,we pro-pose the approach based on the priority of the job scheduling via real-time monitoring network information flow obtained from the application layer.In order to ensure load balancing,two approaches(named Flow-based and Spray)using ECMP(Equal-Cost multi-path routing)are proposed in Fat-Tree topology.The experimental results show that our scheduling approach can enhance the job execution efficiency of the shuffle stage,especially significantly reduce the network transmission time of the highest priority job.
Keywords/Search Tags:Massive Time Series Processing, Similarity Join, Streams Aggregation, Cor-relation Coefficient, Network Scheduling
PDF Full Text Request
Related items