Font Size: a A A

Computing Exact Permutation P-Values For Pattern Discovery

Posted on:2017-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:J WuFull Text:PDF
GTID:2348330488459946Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data mining aims to extract useful information from a large amount of data. Over the years, pattern discovery is an important problem in the filed of data mining. Every algorithm of pattern discovery can output a lot of patterns. However, a large portion of the patterns reported by these algorithms just satisfy the user-defined constraints purely by accident, rather than reflect characteristics of pupulations. Adopting those patterns to do futher analysis will mislead our decisions.Generally, statistical significance testing is an available and effective technique to filter out those meaningless patterns. In statistical significance testing, permutation testing is a commonly used method. However, this permutation method has three disadvantages. The first drawback is that the original statistic value of a pattern is extreme, its p-value is more likely to be equivalent to 0. The second limitation is that the p-values of a same pattern may be inconsistent in different runs. Third, the computational cost of the permutation method is very high, and to obtain more accurate and stable results, more permutations need to be generated, rendering a more time-consuming task. These disadvantages limit the usability of the permutation method.In this paper, we first provide an analysis of these disadvantages detailedly and propse an idea to compute the exact null distribution of permutation testing. Then, taking association rule mining and discrinative patterns discovery in the field of pattern discovery as two application backgrounds, we design two algorithms called "Exact Permutation p-values for association rules" (EPAR) and ''Exact Permutation p-values for Discriminative Patterns'" (EPDP) for generating the exact null distributions. Both algorithms use the theories of permutation and combination, and can generate the null distributions without producing all permuted data sets. Hence, the exact p-values of tested association rules and discriminative patterns can be calculated from the exact null distributions. Meanwhile, we also proof that the two algorithms are unbiased. We adopt four techniques to improve the efficiency of the two algorithms. To compare EPAR and EPDP with direct permutation testing method and direct permutation-based testing method comprehensively, we use six different type data sets from UCI machine learning repository to conduct our experiments. Experiments on the UCI data sets demonstrate that EPAR and EPDP can successfully alleviate those disadvantages and outperform the direct permutation method and the direct permutation-based method over several performance measures, which shows the feasibility and the advantages of computing exact permutation p-values.
Keywords/Search Tags:Data Mining, Association Rule Mining, Discriminative Pattern Discovery, Permutation Testing, Exact Permutation p-value
PDF Full Text Request
Related items