Improved Parallel Fp-Growth Algorithm Based On Hadoop

Posted on:2014-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:S H Zhou

Full Text:PDF

GTID:2248330398460159

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Frequent pattern mining is an important data mining algorithms. Frequent pattern mining is widely applied in the study of database mining like transactional database, time series database and many other types of databases. Serial Frequent-pattern Growth (Fp-Growth for short) methodology will run into bottleneck in both storage and calculation when the scale of the mining dataset is large enough. So in this situation, it is necessary to parallelize the Fp-Growth methodology. Existing parallel Fp-Growth algorithm has solved the problem about how to partition the dataset, and to ensure that the partition is independent of one another, but the existing parallel Fp-Growth is lack of load balancing considerations when it partitioning thedataset. Therefore, achieving the load balanced parallel Fp-Growth algorithm is the main problem of this paper.H ado op is an open-source distributed parallel programming frame under Apache foundation which allows the computer cluster to process large datasets distributedly through the utilization of simple programming model. Fladoop could solve-the parallel computing problems such as job scheduling, distributed storage, fault-tolerant and network communication. It allows developers only need to focus on the algorithm itself, Hadoop helps processing the scheduling and other problems of system itself. Therefore, the paper adopts the Hadoop frame to realize the parallel Fp-Growth methodology.Two works are accomplished in this paper. The first one is the improvement of the existing parallel Fp-Growth methodology. The other one is the application of the improved methodology on the mining of frequent user access paths. Firstly, this paper improves the grouping strategy of the existing parallel Fp-Growth methodology adopting the method of estimating the load of each frequent item, on the basis of the study of parallel Fp-Growth methodology based on Fladoop. Tests show that the parallel Fp-Growth method proposed in the paper is better than the existing parallel Fp-Growth methodology. What’s more, the proposed method has better ability of load balancing and the execution efficiency. Secondly, massive amount of user access information are stored in the Web server log. There fore, discovering the hidden and valuable user behavior information from the massive amount of information is feasible. So. the paper applies the proposed parallel l’’p-(jrowth methodology in the field of Web logs mining to excavate the frequent user access path pattern. The result of this direction of application may provide guidance and reference lor the source website of the log which is of practical application value and commercial value.

Keywords/Search Tags:

Parallel Fp-Growth, Hadoop, MapRedue, Web log mining

PDF Full Text Request

Related items

1	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop
2	Research On Parallel FP-growth Association Rules
3	Research On The Vertical FP-growth Mining Algorithm Based On Hadoop With Load Balancing
4	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment
5	Research On Association Rules Algorithm Based On Hadoop
6	Parallel Data Mining Algorithms Research Of Hadoop
7	Hadoop-based Parallel Algorithm For Mining
8	Research On Parallel Acceleration Algorithm Of Association Rules Based On Hadoop
9	Research On Parallel Mining Algorithm Of Association Pattern Based On Spark
10	The Research And Implement Of Data Mining Algorithms Based On Hadoop