Font Size: a A A

The Research On Time Series And Cluster Analysis

Posted on:2007-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:1118360212484766Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, data mining and its applications have already come into many disciplines and achieved plentiful fruits in diversified fields, including artificial intelligence and machine learning, database, pattern recognition, bioinformatics, neural computing, and so on. It not only appeals scientists but also catches the attention from governments and industries. The governments, industrial communities, and academic fields are so keen on mastering data mining techniques that they have invested a large deal of money and energy on the corresponding research. Therefore, the progress of data mining will promote the development of science and society.Data mining have many research fields. In this dissertation, we focus on time series and cluster technology. The main works and contributions of the dissertation are summarized as follows:(1) Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. This dissertation proposes the tight upper and lower bounds on the estimation distance using wavelet transform, and we show that the traditional distance estimation is only part of our lower bound. According to the lower bound, we can exclude more dissimilar time series than traditional method. And according to the upper bound, we can directly judge whether two time series are similar, and further reduce the number of time series to process in original time domain. The experiments have shown that using the upper and lower tight bounds can significantly improve filter efficiency and reduce running time than traditional method.(2) For wavelet transform used in dimensional reduction, the traditional algorithms use the first k wavelet coefficients as an approximation of the original time series data set. But it is possible that choosing first k coefficients is not optimal, and perhaps choosing other k coefficients is better than the first k. A new method is proposed to better approximating the original time series data set. The main idea of the method is to choose the same k columns of the wavelet time series data set which have the maximum square sum. The experiments have shown that our method can reduce the relative error compared to traditional algorithms.(3) Similarity search for time series under dynamic time shifting is prevailing. But most recent research focused on the full length similarity match of two time series.A new basic subsequence similarity search algorithm based on dynamic programming is proposed. For a given query time series, the algorithm can find out the most similar subsequence in a long time series. Furthermore two improved algorithms are also given. They can reduce the computation amount of the distance matrix for subsequence similarity search. Experiments on real and synthetic data sets show that the improved algorithms can significantly reduce the computation amount and running time compared to the basic algorithm.(4) Similarity search in many new database applications can generally be referred as similarity search in metric space. Time series can be seen as one application in metric space. A new index construction algorithm is proposed for similarity search in metric space. The new data structure, called bu-tree (bottom-up tree), is based on constructing the index tree from bottom-up, rather than the traditional top-down approaches. The construction algorithm of bu-tree and the range search algorithm based on it are given. And the update to bu-tree is also discussed. The experiments show that bu-tree is better than sa-tree in search efficiency, especially when the objects are not uniform distributed or the query has low selectivity.(5) In many real world applications, with the databases frequent insertions and deletions, the ability of a data mining technique to detect and react quickly to dynamic changes in the data distribution and clustering over time is highly desired. Data summarizations (e.g., data bubbles) have been proposed to compress large databases into representative points suitable for subsequent hierarchical cluster analysis. We thoroughly investigate the quality measure (data summarization index) of incremental data bubbles. When updating databases, we show which factors could affect the mean and standard deviation of data summarization index or not. Based on these statements, a fully dynamic scheme to maintain data bubbles incrementally is proposed. An extensive experimental evaluation confirms our statements and shows that the fully dynamic incremental data bubbles are effective in preserving the quality of the data summarization for hierarchical clustering.(6) Probabilistic diagnosis algorithm is very important in the system level fault diagnosis research. Based on cluster theory, we propose four probabilistic algorithms using greedy algorithm for system-level fault diagnosis. By computer simulation, it is shown that these diagnosis algorithms can achieve a high probability of correct under low time complexity. The greedy algorithm one has the best performance in the fourprobabilistic algorithms. The results also indicate that our algorithms have better performance than the Compete algorithm and Majority algorithm, which are classic probabilistic algorithms in system level fault diagnosis.
Keywords/Search Tags:Data Mining, Time Series, Similarity Search, Wavelet Transform, Metric Space, Clustering
PDF Full Text Request
Related items