Font Size: a A A

Knowledge Discovery In Time Series

Posted on:2010-09-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WanFull Text:PDF
GTID:1118360308962199Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of database, internet and telecom technique, time series data appears widely in real world life, such as telecommunication business, financial market, manufacture process, science experiment, medical care, meteorology, bio-information etc. and the scale of data is huge and exploding. How to help users to discover hidden pattern, information and knowledge from large time series data in order to support decision is a raring demand. It's also the core research topic in time series data mining. Recently, researchers just start to concern time series data mining, many topics are challenging, many algorithms need to be extended and perfected.This paper reviewed time series data mining from prediction, classification, clustering, search and frequent pattern discovery five aspects. The main methods in each area are summarized and reviewed, then we focus on the research topics in frequent pattern discovery and community identification in dynamic complex networks deeply. Finally based on the conclusion of this paper, we proposed some research directions in future. The contributions of this paper are as follows:(1) Proposed a frequent pattern discovery algorithm FPM (Frequent Pattern Mining), this algorithm considered not only the supports but also the distributions of frequent patterns. It can discover the "exceptions" frequently appears in a special time segment and the sequential event frequently appears in the whole time series. Based on these frequent patterns with different distributions, we extend MAMC (Mixed memory Aggregation Markov Chain) model to FMAMC (Frequent pattern based Mixed memory Aggregation Markov Chain) model. FMAMC described the temporal correlations among different types of frequent patterns in time series. Experiments demonstrate FPM performs better than PrefixSpan (Prefix-projected Sequential pattern mining) and WinMiner algorithms, FMAMC model can predict time series more accurately than MAMC model.(2) Proposed a frequent pattern based time series classification framework, it included feature extraction, feature selection and classification three phases:the proposed MNOE algorithm extracts non-overlap frequent patterns from time series and proposed EGMAMC (Episode Generated Mixed memory Aggregation Markov Chain) model based on these non-overlap episode to describe time series. After that, according to ratio likelihood test, we induced the support that can guarantee the significance of non-overlap episode. Finally, according to the definition of entropy, we selected the significant non-overlap episode whose entropy of distribution exceed a specified threshold to train classification model. Experiments demonstrate that select frequent patterns as features can improve classification models effectively.(3) Proposed the concept of feature flow to describe the distributions of frequent patterns in time series, we categorized the frequent pattern into three types according to their spectrum of feature flows and map different types of frequent patterns to different types of event hidden in time series. We proposed EDPA (Event Detection by Pattern Aggregation) algorithm to aggregate tightly correlated frequent patterns, each cluster of frequent patterns represents an event. Experiments demonstrate that inputting maximal significant non-overlap frequent patterns into EDPA algorithm to detect event can get more accurate result than selecting other types of frequent patterns.(4) Proposed a family of static community identification algorithms based on maximal cliques and a dynamic community identification algorithm:Firstly, this paper proposed an algorithm CLIM (CLIque Mining) to discover maximal cliques in complex networks. We designed the prune strategy of CLIM based on the observation that the clustering coefficient of most of the vertices in complex network is large. Experiments demonstrate CLIM performs better than Improved BK(Bron-Kerboscht) algorithm on complex network and random graphs with high clustering coefficient. Based on the maximal cliques found by CLIM, this paper defined a community consists of community core and peripheral vertices and proposed a community identification algorithm CDPM (Clique Directed Percolation Method) based on overlapping maximal cliques. CDPM utilized the proposed SSC (Structure Silhouette Coefficient) as objective function to identify communities in complex networks. The value of SSC is larger the corresponding community identification is better. Experiments demonstrate that CDPM algorithm performs better than CPM (Clique Percolation Method), GN (Given-Newman) algorithms evaluated by F-measure and VAD (Vertex Average Degree). By considering link structures and attributes attached on vertices in complex networks, this paper proposed a JCCM (Joint Clustering Coefficient Method) algorithm to identify communities in informative graph. Firstly, it utilized heuristic method to calculate community cores formed by overlapping maximal cliques. Then JCC is used as the objective function to aggregate community cores and peripheral vertices together, different distance functions are used to measure the distance between vertices in different positions. Experiments demonstrate JCCM performs better than existing algorithms which do not consider structure information and attributes attached on vertices cooperatively. Based on the static community identification algorithms, by defining evolving relationships among static communities at adjacent time points, this paper proposed a dynamic community identification algorithm DCI (Dynamic Community Identification). DCI utilized a MDL (Minimum Description Length) based method to identify communities in static complex networks. Then, we abstract the communities in adjacent complex networks and their overlapping relationship as bi-graphs and reduce the evolutions of communities in adjacent static complex networks into a bi-graph partition problem. A MDL based method is also proposed to partition bi-graphs. Experiments demonstrate that DCI algorithm can identify communities more accurately and run faster than GS (GraphScope), GC (GroupColoring) algorithms.(5) Abstract the process of information transition in a social network as a semi-Markov process model in which vertices appear following some transition probability. We proposed "idle state" to extend multi-dimensional Markov process, and proposed social network information flow model based on it. Each state in state space of information flow model represents an actor in a social network, transition probability in information flow model is the probability that two actors exchange information. This transition probability is determined by the characteristic of actors and the sub-structures of social network actors involved in. We proposed reward function based on social network structure and construct reward Markov model to calculate customer value. Based on information flow model and reward Markov model, we proposed a collaborative filtering algorithm SMRR (Semi-Markov and Reward Renewal) by considering personalized behavior patterns of actors and the influence of interacting social actors together. Experiments demonstrate SMRR algorithm performs better than traditional cosine algorithm and RIF (Rate Information Flow) algorithm evaluated by accuracy.
Keywords/Search Tags:Time series, Data mining, Frequent pattern discovery, Complex network, Community Identification
PDF Full Text Request
Related items