Font Size: a A A

Research On Parallel FP-growth Association Rules

Posted on:2017-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:S Q LouFull Text:PDF
GTID:2308330485484777Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Association rules is a basic and important model of data mining areas, in which the frequent pattern growth algorithm(FP-growth algorithm) is a classical algorithm. However, as the volume of data need to be processed increases, FP-growth algorithm for data mining is inefficient or it cannot even build the global FP-tree in the memory. Therefore, parallel FP-growth algorithms have been proposed for data mining. The optimization of parallel FP-growth algorithm association rules also has been studied, most of which focused on solving the problem of load balancing for each child node, but without taking inter-node communication strategy into account.Association rules have been well used in the field of finance. Financial risk is directly proportional to the size of enterprise, and financial risk analysis is an important step of enterprise financial risk management. But the existing methods of financial risk analysis consider problems from qualitative and quantitative point of view, both two aspects have shortcomings. The Apriori algorithm adopted by quantitative analysis may not cope well with the current situation of huge financial information. To solving the above problems, the main work of this thesis showed as follows:(1) Mapping Data by F-list will bring the load imbalance problem and too huge communication cost. To solve these problems, this thesis proposes two improved algorithms: the first one is the node load optimization algorithm based on the greedy strategy(GFP). It uses greedy algorithm to process the level mapping which is applied on the parallel FP–growth. It takes advantage of greedy strategy to group frequent 1-itemsets into a table, by adopting the tactics of local optimum in order to achieve the global optimal results eventually, so that each child nodes with approximate calculation load. The second one is the node traffic control model based on grouping. Although it is possible for GFP algorithm to optimize the node load balancing, when a frequent project conditions of model base is mapped to a group of maximum load, there may be a large amount of data transmission which resulted in increased traffic between nodes in this process. Thus we purpose a FP-growth based on Traffic optimization(TFP). This algorithm can solve this problem better, when it divides frequent item into groups, giving priority to the most frequent projects in the same child nodes and assigning them in the same group. This algorithm keeps the load balance of each child node, and ensures that there is a smaller traffic between nodes, in order to have better traffic control effect.(2) In view of the defects of financial business quantitative analysis, substituting the improved parallel FP-growth association rules TFP algorithm for the Apriori algorithm. The TFP algorithm has the capacity to handle large data sets. Compared with the previous algorithm, the time complexity and space complexity are well optimized. In the end, an enterprise financial risk analysis system based on parallel FP-growth has been designed. The system we implemented including four layers, communication layer, business layer, data processing layer and distributed storage and computing layer. With the platform of Hadoop the system can mine the massive financial, and get the association rules between the financial risk metrics.
Keywords/Search Tags:FP-growth, Parallel Algorithm, Hadoop, Load Balance, Risk Analysis
PDF Full Text Request
Related items