Font Size: a A A

Optimize The FP-tree Based Graph Edge Weight Computation On MapReduce

Posted on:2019-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:M H GuoFull Text:PDF
GTID:2428330566961904Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Weighted graphs have to be constructed before weighted graph based data mining algorithms can be applied,such as clustering analysis,collaborative filtering,etc.However,weighted graphs have to be constructed before such algorithms can be applied.Weighted graphs constructing from a given datasets include node identification,feature extraction and weight measurement.When the extracted features are represented as a set of independent attributes,the weight associated to the edge between two vertices can be measured using the Jaccard similarity between the corresponding two sets.The high computation complexity,large I/O and huge storage consumption of all-pairs set intersection computation makes the edge weight computation a challenging tasks for large-scale graphs.The FP-tree based edge weight computation(EWC for short)with MapReduce has demonstrated its remarkable performance in extracting weighted graphs from large datasets for data analysis.However,we find that the original algorithm performance can be improved.1.The workflow includes unnecessary scan on the datasets.2.The FP-tree involves unnecessary information.3.The Reducers-to-cores mapping strategy of the last MapReduce job is inappropriate which make it exhaust the resources and fail to complete the job.4.The MapReduce cluster nodes have imbalance workload,and The study found that the grouping/partitioning strategy of parallel FP-tree algorithm is a key factor of affecting work load of each node.The problem of data skew and workload imbalance occurs during job execution because of the original algorithm uses a grouping strategy based on the common frequency item.In this survey,we focus on the original FP-tree-based large-scale weighted graph edge weight computation with MapReduce on multi-core Clusters has the problems of uneven load,the workflow of algorithm needs to be optimized,and propose the corresponding optimization scheme:first,we design a more compact FP-tree based EWC with 2-phase MapReduce,reducing one phase scan of the datasets;second,we propose a reduced FP-tree data structure to reduce the FP-tree construction cost;third,we examine two strategies for mapping Reducers to cores for EWC on each multi-core computer: one-Reducer-one-core and one-Reducer-multiple-cores;finally,we propose a grouping strategy based on greedy strategy balanced load to achieve load balancing among cluster computing nodes.In the algorithm evaluation phase of this paper,an empirical comparison performance study has been carried out on the optimized EWC algorithm against the existing one over a massive application datasets generated by a real Facebook social network.The results demonstrate that the optimized FP-tree based EWC algorithm obtains about 39% to 55% percentage improvement in execution time,and in the meantime achieves better scale-out and scale-up speedup,and the grouping strategy based on greedy strategy balanced load can achieve load balancing.The finds of this paper can also be applied to improve the scalability and efficiency of the parallel and distributed execution of applications involving large scale all-pairs set intersection computation over multi-core MapReduce clusters.
Keywords/Search Tags:Weighted Graph Construction, Edge Weight Computation, FP-tree, MapReduce, Multi-core, All-pairs set intersection computation, Load Balance
PDF Full Text Request
Related items