Font Size: a A A

Mining Condensed Sets Of Sequential Patterns And Structured Patterns

Posted on:2007-02-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1118360242461940Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the information industry, the huge amounts of data are collected, and there is the imminent need for turning such data into useful information and knowledge, thus, data mining has attracted a great deal of attention. Data mining helps to discover hidden knowledge from volumes of data. The mining of sequential patterns and structured patterns is an important data mining problem with broad applications.The most important and influential algorithms are GSP and PrefixSpan in sequential pattern mining. These algorithms mine the complete set of sequential patterns, however, it has also been widely recognized that frequent sequential pattern mining often produces a huge number of sequential patterns. An approach to address this problem is that mining a condensed set, instead of mining the complete set. By mining a condensed set, the number of sequential patterns can be reduced dramatically, but the general information of the frequent sequential patterns is essentially preserved. A much smaller set of patterns certainly helps users comprehend the mining results and leads to better efficiency.A condensed frequent sequential pattern base is such a kind of condensed set. It is a special subset of frequent sequential patterns, and is used to approximate the support of frequent sequential patterns not in the base with guaranteed maximal error bound specified by the user. There are two methods to construct a condensed frequent sequential pattern base. The first method constructs a base by examining all frequent sequential patterns level-by-level: a frequent sequential pattern is added into the base only if it cannot be approximated by its sub-pattern in the base. The second method constructs a base by considering only maximal frequent sequential patterns at various layers for a range of support thresholds. In this class of algorithms, the method of how to determine whether a frequent sequential pattern is maximal is proposed and the pruning techniques can prune some unpromising sequential patterns as early as possible to speed the mining process.Compressing the set of frequent sequential patterns is another method in oder to address the problem of explosive number of output sequential patterns. In order to get high-quality compression, it first clusters frequent sequential patterns, and then select and output only a representative sequential pattern for each cluster such that the number of these representative sequential patterns is minimized. A greedy algorithm and an efficient candidate_based algorithm are proposed. The set of representative sequential patterns is also a kind of condensed set. Experimental results show that it can achieve very good compression effect.Mining tree patterns is more difficult than mining sequential patterns because of the combinatorial explosion in frequent subtree mining. A condensed frequent subtree base is consisted of maximal frequent subtrees for a series of support thresholds. It is a subset of frequent subtrees, and is used to approximate the support of arbitrary frequent subtree with guaranteed maximal error bound. In addition, an algorithm is developed to mine such a condensed subtree base in a database of labeled rooted ordered trees. The algorithm is also used for mining a condensed subtree base from a database of labeled rooted unordered trees with only minor changes. It adopts the way of right-most extension to generate systematically frequent subtrees. Several techniques are proposed to prune the branches that do not correspond to maximal frequent subtrees. Heuristic techniques are used to arrange the order of computation so that relatively expensive computation is avoided as much as possible.Frequent patterns in a database can provide information on how to build efficient indexing structures for the databases. A new indexing method, called discriminative, frequent subtree_based indexing, first generates all frequent subtrees, and select discriminative subtrees among them as indexing features, then translates subtrees in the feature set into sequences, and holds them in a prefix tree. Frequent substructure explore the intrinsic characteristics of the data and are relatively stable to database updates. Discriminative, frequent subtree_based indexing can improve dramatically the performance of subtree search.
Keywords/Search Tags:data mining, sequential pattern, frequent subtree, condensed set
PDF Full Text Request
Related items