Research On Parallel Data Mining Algorithms Based On Hadoop

Posted on:2016-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:Chen

Full Text:PDF

GTID:2308330461456049

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of social Internet technology, computer technology, large amounts of data are remained. Different kinds of information grow rapidly. Faced with vast amounts of data, how to find a feasible and efficient data mining method is a difficult problem to contemporary society. The traditional data mining algorithms can deal with small-scale data, but they may not necessarily suitable for large-scale data processing. Under this requirement, parallel data mining algorithms have emerged. As an important parallel computing technology tool, Hadoop parallel framework has attracted the attention of the business community and academia. Hadoop Framework is an academic hot issue in data mining algorithm.Apriori algorithm is one of the most typical data mining algorithms, the technical bottlenecks in large-scale data mining is that a huge amount of data are always traversing many times, and it causes I/O bottlenecks, also increases computation time. There are a lot of the optimization algorithms of Apriori algorithm, which mainly are parallel algorithms including CD (count distribution), DD (data distribution), CaD (candidate distribution) algorithm. There is another optimization Apriori algorithm which is based on Hadoop. PageRank algorithm is the core algorithm of commercial search engine, facing to the soaring number of page data, it is difficult to avoid the overhead of processing time-consuming which happened in much iteration and traverse the page data. Issued to PageRank algorithm handling large data, scholars already have a lot of achievements, such as the PageRank algorithm that does not achieve the best results by transplanting directly PageRank algorithm onto Hadoop platform.This thesis focuses on the transplantation and optimization of Apriori algorithm and PageRank algorithm in Hadoop platform. Combined with MapReduce framework of Hadoop distributed computing platform, Apriori algorithm use a parallel connection operations called Data Join to achieve the next computation at each iteration. This thesis does some optimization in the PageRank algorithm, this PageRank algorithm input the a site, unlike the old algorithm which input a single page. And the computation process introduces three levels of data compression methods, thereby reducing the amount of data traffic and storage. For the optimization algorithm proposed, we use different data sets and different distributed cluster to test performance of algorithm, and compared with the other algorithms. Experiments show that the proposed algorithm improved the data adaptability and the efficiency of algorithm, which greatly Reduce the execution time of the data mining, and there is certain practical significance in it.

Keywords/Search Tags:

Hadoop, MapReduce, data mining, Apriori, PageRank

PDF Full Text Request

Related items

1	MapReduce-based Graph Mining Research
2	Research Of Frequent Itemsets Mining Algorithm Based On MapReduce Calculation Model
3	Research On A Parallel Data Mining Algorithm Apriori
4	Optimization And Implementation Of PageRank Using MapReduce
5	Research Of Data Mining Method For Public Buildings Energy Consumption Based On Hadoop
6	Research On Association Rules Algorithm Based On Hadoop
7	Design And Implementation Of PageRank Computing System Based On MapReduce
8	The Improved Apriori Algorithm Based On Hadoop Calculation Model
9	Research And Improvement Of Apriori Algorithm Based On Hadoop
10	Research And Application Of Improved Apriori Algorithm On Hadoop