Font Size: a A A

Research On Parallel Data Mining Algorithms Based On Hadoop

Posted on:2016-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:ChenFull Text:PDF
GTID:2308330461456049Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of social Internet technology, computer technology, large amounts of data are remained. Different kinds of information grow rapidly. Faced with vast amounts of data, how to find a feasible and efficient data mining method is a difficult problem to contemporary society. The traditional data mining algorithms can deal with small-scale data, but they may not necessarily suitable for large-scale data processing. Under this requirement, parallel data mining algorithms have emerged. As an important parallel computing technology tool, Hadoop parallel framework has attracted the attention of the business community and academia. Hadoop Framework is an academic hot issue in data mining algorithm.Apriori algorithm is one of the most typical data mining algorithms, the technical bottlenecks in large-scale data mining is that a huge amount of data are always traversing many times, and it causes I/O bottlenecks, also increases computation time. There are a lot of the optimization algorithms of Apriori algorithm, which mainly are parallel algorithms including CD (count distribution), DD (data distribution), CaD (candidate distribution) algorithm. There is another optimization Apriori algorithm which is based on Hadoop. PageRank algorithm is the core algorithm of commercial search engine, facing to the soaring number of page data, it is difficult to avoid the overhead of processing time-consuming which happened in much iteration and traverse the page data. Issued to PageRank algorithm handling large data, scholars already have a lot of achievements, such as the PageRank algorithm that does not achieve the best results by transplanting directly PageRank algorithm onto Hadoop platform.This thesis focuses on the transplantation and optimization of Apriori algorithm and PageRank algorithm in Hadoop platform. Combined with MapReduce framework of Hadoop distributed computing platform, Apriori algorithm use a parallel connection operations called Data Join to achieve the next computation at each iteration. This thesis does some optimization in the PageRank algorithm, this PageRank algorithm input the a site, unlike the old algorithm which input a single page. And the computation process introduces three levels of data compression methods, thereby reducing the amount of data traffic and storage. For the optimization algorithm proposed, we use different data sets and different distributed cluster to test performance of algorithm, and compared with the other algorithms. Experiments show that the proposed algorithm improved the data adaptability and the efficiency of algorithm, which greatly Reduce the execution time of the data mining, and there is certain practical significance in it.
Keywords/Search Tags:Hadoop, MapReduce, data mining, Apriori, PageRank
PDF Full Text Request
Related items