Font Size: a A A

Association Rules Mining And Its Application To Gene Expression Data

Posted on:2008-08-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q MiaoFull Text:PDF
GTID:1118360212498595Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Association rules mining (ARM) is an important problem in data mining. Since the problem was introduced by A grawal et al. in 1993, it has attracted tremendous interest among researchers and practitioners. With the biology data growth rapidly, bioinformatics posed a great challenge and chance for ARM.This dissertation extensively studied algorithms for ARM, parallel ARM and ARM in gene expression data. The main content, contribution and innovation in the dissertation are described below.(1) The study of algorithms for mining maximal frequent patterns based on FP-treeDue to the search space is exponential of the number of items, efficiently pruning techniques are very important for mining maximal frequent pattern. By investigating the FPmax* algorithm for mining maximal frequent pattern based on FP-tree, we proposed an improved algorithm, FPmax**. FPmax** used the full subset pruning and the first items subset pruning which we presented. An example of FPmax** showed that the two pruning techniques could reduced the computing cost and improved the performance of FPmax*.(2) The study of parallel algorithms for mining frequent closed patterns based on FP-treeWhen a pattern was frequent closed, the pattern was not only frequent but also closed. So, parallel mining for frequent closed patterns was more difficult than frequent patterns. After investigating the parallel algorithms for mining frequent patterns and maximal frequent patterns based on shared-memory and distributed memory systems, we presented two parallel algorithms SL-FP and SP-FP based on shared-memory for mining frequent closed patterns, and two parallel algorithms DL-FP and DP-FP based on distributed memory. The analysis in theory showed that SL-FP and DP-FP spent less I/O, demanded less communication, achieved maximum asynchronous operations and had good load balancing. (3) The study of algorithms for mining frequent closed patterns in gene expression dataset based on hyper-structureFor the sake of avoiding the numerous computing overhead caused by passing the transposed table of dataset many times, we proposed a new algorithm HTclose for mining frequent closed patterns in expression dataset based on hyper-structure. The analysis in theory showed that HTclose greatly enhanced performance against the algorithm passed the transposed table of dataset many times. Several experiments on real-life gene expression datasets showed that HTclose was faster than the latter by up to one order of magnitude.(4) The study of algorithms for mining frequent closed patterns in gene expression dataset based on formal concept analysisTo address the vast computing cost encountered by the bottom-up search strategy for mining frequent closed patterns in gene expression dataset, we proposed two strategies based on the Formal Concept Analysis. One strategy was to convert the search space for frequent closed patterns into the search space for frequent closed tidsets, and another strategy was to search the frequent closed patterns directly by top-down. We designed TPclose and TP+close algorithms based on the two strategies. We proved the two algorithms correct. Several experiments on real-life gene expression datasets showed that TPclose and TP+close received a good performance and scalability, they usually outperformed the algorithms for mining the frequent closed patterns in gene expression dataset based on a bottom-up search strategy by up to two orders of magnitude.
Keywords/Search Tags:association rules mining, gene expression data, maximal frequent pattern, frequent closed pattern, parallel algorithm
PDF Full Text Request
Related items