Font Size: a A A

A Parallel Algorithm For Parallel Correlation Analysis Based On PAC Model

Posted on:2018-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:J YeFull Text:PDF
GTID:2358330512987002Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the forthcoming of Big Data Time and the fast increase of dataset capacity,the parallel/distributed computing-based frequent pattern mining approach presents a palpable advantage over traditional techniques which are limited to the memory volume and processing capability of node in disposing with massive datasets.It is in current setting that this thesis proposes an effective randomized sampling-based method to extract approximate frequent itemsets from large-scale dataset.The key part of this method is to select a random sample of original dataset to mine approximative frequent itemsets.The standard algorithm such as Apriori and FP-Growth typically takes multiple times to pass the whole dataset in order to find the exact solution.For huge dataset,this should result in an infeasiblely expensive on both main memory and computation.In order to efficiently mine the big dataset,this paper presents two new approaches to extract a high-quality approximation of the frequent itemsets(FIs)from random sample of a transactional dataset.With high probability the approximation shall be a superset of the FIs.One of the approaches utilizes empirical Rademacher complexity from theory of computational learning,united with the Concentration inequality to obtain an upper bound to the empirical Rademacher complexity of the dataset sample and then apply a conclusion from statistical theory to derive an approximate upper bound of the supremum of itemset's absolute frequency error.Another approach uses directly a variant of the classical Bernstein's Inequality to limit the absolute frequency error of itemset.In a sequel,both of the two methods employ the(,)-approximation of frequent itemsets to deduce eligible sample size for mining approximate frequent itemsets.The PAC(Probably Approximately Correct)learning framework is applied to analyse the sampling completeness.Since the frequent itemset discovery problem was proposed,a large body of literature refers to the sampling-based technique to obtain the approximate frequent itemsets and many a paper suggests the parallel/distributed frequent itemsets discovery algorithm based on big data setting.The extended experiment adopts both real dataset retail and artificial T10I4D100 K dataset to evaluate the accuracy of returned collections in terms of recall,precision and frequency error estimation.Moreover,the running time of our methods were measured on both serial computing(single node)and parallel/distributed computing(pseudo-distributed platform of hadoop).At last,we made comparision on the performance of diverse theoretical methods and respecitive requirement for the sample space.To the best of our knowledge,this paper is the first to utilize empirical Rademacher complexity and Bernstein's inequality to the sampling-based frequent itemsets mining problem and applies conclusion from statistical learning theory to derive a stringent sample bound.Compared with the VC-dimension-based method proposed by Riondato and Upfal,our methods suggest smaller sample space to obtain similar approximate quality.Meantime,the sample complexity in our methods is almost the polynomial function of inverse of approximate error 1/? and inverse of confidence argument 1/?,which shows our methods are PAC-learnable.We employ the conception and technique from computational learning theory to dispose of a practical problem,this kind of research idea may be used in the other aspects of data mining.
Keywords/Search Tags:association analysis, sampling, computational learning, Map-Reduce
PDF Full Text Request
Related items