The Research Of Cloud Frequent Itemsets Mining Algorithms Which Based On Sample

Posted on:2014-09-20

Degree:Master

Type:Thesis

Country:China

Candidate:W Wan

Full Text:PDF

GTID:2268330401488953

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of data collection technology, the era of massive data is coming. Business competition is fierce in today’s society, people are eagering to extract useful informations from massive data which help them to make correct business decisions. However, the traditional data analysis and data mining techniques are difficult to meet the demand of people in dealing with massive data, because of the excessive high cost of times and spaces. For example, the traditional frequent itemsets mining needs to scan data sets so many times that cost a lot of times.And it also needs to store a large number of candidate itemsets,which consumes large amount of memories.At the same time, cloud computing with high concurrency and low cost of mass data processing,is developing with high speed. In recent years, Hadoop ecosystem’s development is the most representative. Hadoop is mainly composed of two parts:HDFS and Mapreduce. It uses cheap commercial machines as compute nodes to constitute a cloud platform which can efficient processing massive data.Combine data mining with cloud computing, this means using the advandage of cloud computing such as efficient processing massive data to deal with massive data mining which will bring new vitality to traditional data mining technology. This thesis aims at combining the data mining’s frequent itemsets mining with cloud computing. The main work is as follows:(1) On the first, this thesis gives an in-depth research and analysis of Hadoop platform. Two core parts of Hadoop:HDFS distributed file system for mass data storage, mapreduce parallel programming framework for data processing. These two parts both supplement each other, constitute Hadoop distributed framework.(2) In order to further improve the efficiency of frequent itemsets mining, a parallel sampling algorithm based on Hadoop is proposed in this thesis. This algorithm which using the mapreduce programming framework can achieves a random sampling by scanning the massive data just one time.In the sampling process, the clean-up work also can be made on the data by the same time(3) After making an in-depth research on traditional mining algorithm of frequent itemsets, a cloud frequent itemsets mining algorithm which based on sample is proposed in this thesis. The algorithm uses Hadoop platform to make full use of the advantage of cloud computing to process massive data.Result of experiments shows that this algorithm has a good mining performance.

Keywords/Search Tags:

data mining, frequent itemsets, Hadoop, mapreduce

PDF Full Text Request

Related items

1	Research Of Frequent Itemsets Mining Algorithm Based On MapReduce Calculation Model
2	Study On Parallel Mining Frequent Itemsets Over Uncertain Database Based On Hadoop
3	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System
4	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application
5	The Research And Implementation Of Mining Frequent Itemsets Algorithm Over Streaming Data
6	Research On Algorithm For Mining Frequent Itemsets Of Uncertain Data
7	FP-Tree Based Mining Frequent Itemsets Over Data Streams
8	Research On Algorithms For Mining Maximal Frequent Itemsets
9	Frequent Itemsets Mining Algorithm And Its Application In Data Flow
10	Research On Frequent Closed Itemsets Mining Algorithms