Font Size: a A A

The Research And Implementation Of Some Data Mining Algorithms On Cloud Database

Posted on:2014-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:L GuoFull Text:PDF
GTID:2248330395997859Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
From the early supercomputers to multi-core CPU technology, the cloudcomputing technology has been relatively mature which enforcing computing powerin the meantime lowering cost. It is very difficult when handle big data for many datamining algorithms, while to find knowledge from it is indispensable. That is,combining data mining algorithms with cloud technology is the future trend.Unfortunately there are less research findings for certain difficult and not all thealgorithm can execute on the platform.The corn idea of cloud technology is MapReduce which all the strategy ofchanging for data mining algorithm need to base on. MapReduce model has twodifferent parts[9]: MAP and REDUCE. The map function receives input pairs andproduces a set of intermediate key/value pairs, then sends intermediate key/value pairsto reduce nodes. The reduce function accepts an intermediate key and a set of valuesfor that key and merges these values together to form a possibly smaller set of values.All algorithms must separate into those two functions, if not that will difficult orunable to run on the platform. In this paper, before proposed the improving algorithmwe have analyzed the feasibility.We proposed improved methods for three data mining algorithms. The first one isApriori which is difficult to find frequent structures from big data set. Although thereexits some improved methods and have the ability to reduce the times of scanningoriginal data set, but almost all of those are depending on sacrificing disk space whichis suboptimal for mining big data. In this paper, we proposed a developed algorithmfor Apriori, even though it turns out all frequent structures in different length andcould not prune ones that are not the longest, which limited by the MapReduceframework.How to fix the maximum support in the case of dispersion of the data source wasdescribed in this paper. Due to the limitation of distributed structure, finding frequentdata from decentralized data may lead to mine out partial correct results. Aiming to ensure the accuracy of the results, we add one node to process results sent out fromREDUCE, which can mine out all frequent results in the condition of REDUCE hasimpure data.Iceberg-CUBE uses Apriori to prune cubes that does not meet Iceberg conditionswhich has been done completely, then only need to solve the problem of how todivide the original data. MAP nodes mainly divide the data that is similar to BUC,using MAP to divide the data for one or more times. Because of the need of increaseMAP nodes in each divide loop, the balance is indispensible between the number ofMAP and the number of REDUCE. The shell of the cube is a fraction ofmulti-dimension database, which designed by the association and the meaning of thedata, kinds of queries that frequent done and the actual situation. Establishing shell isa good solution for the problems, such as big data sets, queries processed in long timeand slow. Traditional shell algorithm is suitable to MapReduce, which needs certainchanges then can perform in the platform.
Keywords/Search Tags:Cloud computing, Apriori, Iceberg-CUBE, Shell fragments
PDF Full Text Request
Related items