Font Size: a A A

Distributed Embedded Tree Pattern Mining On MapReduce

Posted on:2019-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhaoFull Text:PDF
GTID:2428330545999748Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Finding embedded patterns from tree-structured data is a significant research issue with many practical applications,such as e-commerce,biology,user behavior prediction,etc.Over the years,there are many algorithms proposed to solve this problem.However,these algorithms Almost all focus on digging on the stand-alone model.Given the explosion of data scale,these solutions are inefficient or lack of feasibility.In this thesis,according to the different types of input data,the mining of frequent unordered embedded tree patterns in distributed environments based on MapReduce is studied from two directions:transactional data and single tree data.In this thesis,we propose a distributed frequent unordered embedded tree pattern mining algorithm TETPM for transactional data.The TETPM algorithm is divided into two phases:the preparation phase finds all frequent labels;each time of iteration of the iteration phase generates size k + 1 pattern candidates from the frequent size k frequent patterns in a two-two-mode pattern join manner and maintain a list of instances of each pattern in the process,then calculate the support of each pattern candidate through the join of the instance lists of the two patterns.Based on TETPM algorithm,we propose two algorithms with different load balancing strategy based on data granularity:TETPM-P partitions work load by pattern and TETPM-E partitions work load by pattern instance.We compare the two algorithms in the later experimental part.Experimental peformance results show that both can handle the data scale that can not be solved by the stand-alone state-of-art algorithm,and TETPM-E is more suitable for larger datasets,while TETPM-P is more suitable for datasets with more equal number of instances of frequent patterns.In this thesis,we propose a distributed frequent unordered embedded tree pattern mining algorithm EtpmLtd for single tree data.The EtpmLtd algorithm runs in an iterative manner.Each iteration is divided into three phases:Candidates enumeration phase enumerates size k +1 candidate patterns from the known size k frequent patterns in a two-two-mode pattern join manner;the local mining phase computes the instances of each candidate pattern in each data block and ensures that no instances are missed by external descendants expansion strategy;the global summary phase computes the global support according to all instances.We propose two optimization stategies based on the EtpmLtd algorithm to improve the running efficiency and reduce the communication consumption.The results in the later experiments show that the EtpmLtd algorithm can handle the data scale that can not be solved by the stand-alone state-of-art algorithm,and the optimizations can bring obvious performance improvement.
Keywords/Search Tags:MapReduce, Embedded tree pattern, Workload balancing, Transactional data, Single tree data
PDF Full Text Request
Related items