Font Size: a A A

Similarity Search And Outlier Detection In Time Series

Posted on:2006-05-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:H XiaoFull Text:PDF
GTID:1118360155460410Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A time series is a data sequence of observations which are ordered in time(or space), which exists in various fields extensively, such as industry, economy, finance, science observing and social science, etc. How to manage and use these time series data efficiently is an interesting problem. Classical time series analysis always propose a hypothesis first, then prove its validity, which is not suitable for discovery task. Time series data mining can extract hidden and potentially useful knowledge from large amounts of data which maybe omitted by user, which attracts more and more attention.At present, main topics of time series data mining include similarity search, temporal pattern mining, classification and cluster and outlier detection. In this dissertation, similarity search and outlier detection are to be studied, which include several problems such as pattern representation of time series, similarity measure, index for similarity search, definition and detection of outlier in time series. The main works and contributions of this dissertation are:(1) Pattern Representation of Time SeriesDue to large amount and complex data character of time series, data mining directly on raw time series is time-consuming and inefficient. Sometimes, the accuracy and reliability of mining results will descend. Pattern representation of time series is abstract and summary of time series, also a high level feature description of time series. A new Piecewise Linear Representation (PLR) of time series is proposed. The method designs a kind of temporal edge operator and use it to find edge points of time series, choose some edge points as segmented points, then connect these segmented points with line segments, called TEO representation of time series. TEO representation is very simple and intuitionistic, it can compress time series much, and remove some noises in time series. Detailed experiments show that compared with serveral PLR representations, TEO representation is a more precise approximation to raw time series, and can be applied various fields with diverse data features.(2) Similarity MeasureEuclidean distance and Dynamic Time Warping (DTW) distance are two main similarity measures between time series. However, Euclidean distance is not suitable for time series when occurs linear drift or time warping, DTW distance cannot beadopted widely because of its square time complexity. A new similarity measure, called Dynamic Pattern Matching (DPM) distance is defined based on the pattern representation of time series. DPM distance can support all transformations of time series include time warping with approximate linear time complexity. Experiments on several ture datasets show that the accuracy of classification of Knn method based DPM distance exceeds based on Euclidean distance and DTW distance, the speed of classification is also improved above ten times.(3) Index for Time SeriesSpatial Access Method (SAM) often be used by similarity search over time series data, such as R-Tree, R*-Tree, X-Tree, etc. All these methods are designed for spatial objects; which only consider the position and shape of spatial objects, ignore order relations between them. In this dissertation, a new spatial index structure called the OR-Tree is proposed, which make some important modifications to classical R-Tree, keep order relations between spatial objects in the leaf nodes of the OR-Tree. This improvement reduces the numbers of access disk and false alarms greatly, and can be generalized to other SAMs. Experiments based on similarity search in time series data prove its high efficiency of the OR-Tree.(4) Similarity Search in Time SeriesSimilarity search based on DPM distance is not high efficient. A Lower Bounding measure (DPMLB) for DPM distance is introduced to improve the similarity search in time series. DPMLB measure compute very fast, need scan whole time series only once. Experiments show that the DPMLB measure can quickly filter above half of time series which is dissimilar with the query time series, and reduce the computation of DPM distance largely. The DPMLB measure also satisfies Lower Bounding Lemma, which ensures integrity of similarity result.(5) Outlier Detection in Time SeriesAt present, there is no definition of outlier of time series to be accepted by most of researchers. Three types of outlier of time series are concluded, and pattern outlier is discussed deeply. A new definition of pattern outlier in time series is introduced, based on the pattern representation of time series, use outlier factor series describe the outlier of time series. With the definition, the TOD algorithm for detect pattern outlier in time series is proposed, which gives an outlier factor sequence as results. A pattern is considered as an outlier if its outlier factor exceeds a specified threshold. Compared with other methods, the TOD algorithm need not training dataset. and is no sensitive...
Keywords/Search Tags:Time Series, Pattern Presentation, Similarity Search, Spatial Access Method, Outlier Detection
PDF Full Text Request
Related items