Font Size: a A A

An Improved Clustering Algorithm For Large-scale Time Series Data

Posted on:2018-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:R H DuFull Text:PDF
GTID:2348330512476867Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The security of temporal data has drawn substantial interest due to the proliferation and ubiquity of time series in many fields.In the anomaly detection system of time-related data,time series clustering is one of the most popular mining method.However many time series clustering algorithms primarily focus on detecting the clusters in a batch fashion that will consume much memory space and thus limit the scalability and capability for large time series.To solve this problem,this thesis proposed a time series clustering method——Ex-BIRCH algorithm,which is based on BIRCH algorithm,to mine the implied information of large time series accurately.The work of the dissertation is partly supported by the National Natural Science Foundation of China(No.61172072,61271308),Beijing Natural Science Foundation(No.4112045),and Research Fund for the Docoral Program of Higher Education of China(No.20100009110002).The main work of this paper includes:Firstly,this thesis compared the existing clustering algorithms and pointed out the challenges of large time series clustering.And then analyzed the advantages of BIRCH algorithm in processing large-scale data.Based on this,an improved clustering algorithm for time series is proposed,and a concrete improvement scheme is introduced:(1)The thesis replaced the distance metric in BIRCH algorithm.Considering the fact that Euclidean distance can't measure the time series accurately,this thesis adopted dynamic time warping(DTW)as the time series distance metric to achieve accurate clustering of time series.(2)The thesis changed the cluster centroid calculation method in BIRCH algorithm.In this paper,we proposed Ad-DBA algorithm based on the barycenter averaging algorithm in DTW(DBA algorithm).The Ad-DBA algorithm can be used to compute the time series mean in the dataflow environment.Ex-BIRCH uses the Ad-DBA algorithm as the calculation method of cluster centroid.(3)The thesis modified the cluster features in the BIRCH algorithm.The change of the distance measure and the averaging method will lead to the failure of the original feature vector in the BIRCH algorithm.By analyzing the calculation process of the DTW algorithm and the Ad-DBA algorithm,a new clustering feature is proposed to replace the original value.To demonstrate the effectiveness of proposed algorithm,this thesis conducted an extensive evaluation of Ex-BIRCH algorithm against BIRCH,k-means and their variants with combinations of competitive distance measures.Experimental results show that the extended BIRCH algorithm promote the accuracy significantly compared with BIRCH algorithm and its variants,and achieved competitive and similar accuracy as k-means and k-DBA.However,unlike k-means and k-DBA,the extended BIRCH algorithm maintains the ability of incrementally handling continuous incoming data objects.Finally the Ex-BIRCH algorithm was applied to solve a subsequences time series clustering task of a simulation multivariate time series datasets with the help of the sliding window.The results show that the improved algorithm can complete the sequential pattern mining in the dataflow environment.
Keywords/Search Tags:Time Series, Data Stream, Clustering, Sequence Pattern Mining
PDF Full Text Request
Related items