Font Size: a A A

New algorithms for frequent sequential pattern and itemset data mining in certain and uncertain databases

Posted on:2013-10-28Degree:Ph.DType:Dissertation
University:University of Arkansas at Little RockCandidate:Peterson, Erich AllenFull Text:PDF
GTID:1458390008469868Subject:Computer Science
In this dissertation, new algorithms and concepts in the areas of sequential pattern and itemset mining—in both certain and uncertain databases—are disseminated. These algorithms include: 1) An improved algorithm for mining frequent patterns, which uses a pattern-growth enumeration technique and new data structure called a first-occurrence forest (FOF). The new data structure is composed of a simple linked-list of pointers, in which each pointer in the list point to the first-occurrences of an item within an aggregate tree (a lossless / compressed version of the pattern database); 2) The definition of the new concept and algorithm for the mining of relaxed closed subspace clusters (RCSCs). RCSCs are mined by transforming the problem of mining for subspace clusters into the problem of mining for frequent itemsets. After the transformation of the problem, a new quality measurement is defined—known as the diameter of the subspace cluster. By using the diameter, the concept of a closed subspace cluster is defined. Lastly, to combat the problem of a possible low number of unique diameters—and thus a high number of unique subspace clusters—a relaxation factor is introduced to partition the user-defined minimum closeness threshold and create a finite number of intervals to which each subspace cluster must belong; 3) The definition of the new concept and algorithm for probabilistic frequent closed itemsets (PFCIs) in uncertain databases called PFCIM for probabilistic frequent closed itemset mining. To this end, the definition of probabilistic support is introduced to facilitate the mining of a lossless and compact representation of all probabilistic frequent itemsets (PFIs) from an uncertain database; 4) A new algorithm for the approximation of PFCIs, named A-PFCIM for approximate probabilistic frequent closed itemset mining. Because the probability of the support of an itemset occurring within an uncertain database can be modeled using the Poisson binomial distribution, and the fact that the Poisson binomial distribution can be accurately approximated using the Poisson distribution, the formulation of an algorithm that approximates PFCIs using the Poisson distribution is possible and is formulated; 5) An algorithm for the mining of approximate probabilistic frequent itemsets (A-PFIs) from incremental or evolutionary uncertain databases called (IA-PFIM). In that research, the properties of the expected support of an itemset in an uncertain database, along with the properties of the Poisson distribution, are exploited to define lemmas that allow one to maintain the set of PFIs in an evolving database; finally, 6) The definition of the new concept and algorithm for the mining of probabilistic generalized frequent itemsets (PGFIs) in uncertain databases called PGFIM for probabilistic generalized frequent itemset mining. In that research, a taxonomy is introduced (in the form of a directed acyclical graph)—relating the items that are within the uncertain database, and those which are generalized (or abstract items) which do not appear in the database. Given this taxonomy, there exist generalized items which do not appear in the uncertain database, but nevertheless, can be probabilistically frequent within. Thus, a new method is introduced which calculates the probability of a generalized item appearing within a transaction, and thus, one can then mine for PGFIs in an uncertain database.
Keywords/Search Tags:Uncertain, Mining, New, Algorithm, Itemset, Frequent, Pattern, Using the poisson
Related items