Font Size: a A A

Improved Parallel Fp-Growth Algorithm Based On Hadoop

Posted on:2014-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:S H ZhouFull Text:PDF
GTID:2248330398460159Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Frequent pattern mining is an important data mining algorithms. Frequent pattern mining is widely applied in the study of database mining like transactional database, time series database and many other types of databases. Serial Frequent-pattern Growth (Fp-Growth for short) methodology will run into bottleneck in both storage and calculation when the scale of the mining dataset is large enough. So in this situation, it is necessary to parallelize the Fp-Growth methodology. Existing parallel Fp-Growth algorithm has solved the problem about how to partition the dataset, and to ensure that the partition is independent of one another, but the existing parallel Fp-Growth is lack of load balancing considerations when it partitioning thedataset. Therefore, achieving the load balanced parallel Fp-Growth algorithm is the main problem of this paper.H ado op is an open-source distributed parallel programming frame under Apache foundation which allows the computer cluster to process large datasets distributedly through the utilization of simple programming model. Fladoop could solve-the parallel computing problems such as job scheduling, distributed storage, fault-tolerant and network communication. It allows developers only need to focus on the algorithm itself, Hadoop helps processing the scheduling and other problems of system itself. Therefore, the paper adopts the Hadoop frame to realize the parallel Fp-Growth methodology.Two works are accomplished in this paper. The first one is the improvement of the existing parallel Fp-Growth methodology. The other one is the application of the improved methodology on the mining of frequent user access paths. Firstly, this paper improves the grouping strategy of the existing parallel Fp-Growth methodology adopting the method of estimating the load of each frequent item, on the basis of the study of parallel Fp-Growth methodology based on Fladoop. Tests show that the parallel Fp-Growth method proposed in the paper is better than the existing parallel Fp-Growth methodology. What’s more, the proposed method has better ability of load balancing and the execution efficiency. Secondly, massive amount of user access information are stored in the Web server log. There fore, discovering the hidden and valuable user behavior information from the massive amount of information is feasible. So. the paper applies the proposed parallel l’’p-(jrowth methodology in the field of Web logs mining to excavate the frequent user access path pattern. The result of this direction of application may provide guidance and reference lor the source website of the log which is of practical application value and commercial value.
Keywords/Search Tags:Parallel Fp-Growth, Hadoop, MapRedue, Web log mining
PDF Full Text Request
Related items