Font Size: a A A

Research On Algorithm Of Large Data Set Sequential Pattern Mining

Posted on:2016-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:D LiangFull Text:PDF
GTID:2208330470951332Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Sequential pattern mining is one of the most important research in data mining, which wasfirst proposed in1995by Agrawal and Srikant. Comparing with other data mining algorithms,sequential pattern mining algorithm is mainly based on ordered datasets to mine sequentialpatterns, it has the advantages of high frequency, practicality. Therefore, sequential patternmining has been widespread concerned and in-depth researched by experts and scholars at homeand abroad, its scope of application is also from the initial market basket analysis to DNAsequence analysis, Natural disaster prediction, Disease diagnosis and other fields.So far sequential pattern mining has a lot of classic algorithms. Among those algorithms,PrefixSpan algorithm is one of the most widely used algorithms. The algorithm uses the prefixprojection technology, which could avoid candidate items effectively, thus to improve theefficiency of mining in a certain extent. However, PrefixSpan algorithm needs to structure a lotof projection database, besides, structuring projection database not only need consume hugememory, but increase a lot of scantime. Therefore, this paper improves PrefixSpan algorithm andproposes BLSPM algorithm. This algorithm can greatly reduce the numbers of constructionprojection database, thus improving the efficiency of sequential pattern mining. In addition, thealgorithm proposes the concept of sequential pattern values, and reorders the results of themining sequence patterns by the values of sequence patterns, so that it can find the mostimportant sequence patterns.Then we make experiments to verify the efficiency of BLSPMalgorithm, from different supports, different types of datasets, different sizes of datasets. Inaddition, in order to improve the efficiency of the algorithm in large datasets, this paperimproves the BLSPM algorithm based on Map-Reduce, and selects supermarket products placeas an application example to verify the efficiency and the Practicality.This paper’s main content and innovations are listed below.(1) Improve PrefixSpan algorithm and propose BLSPM algorithm. Firstly, I make aneffective reduction branches. When we build projection database, if the support is less than theminimum support, we delete these infrequent items and remove them from the sequencedatabase, it coule be able to reduce the time of scanning the projection database. Secondly, Ipropose a method of bi-level projection. If sequence length is odd, we constructe the originalprojection database; if sequence length is even, instead of constructing projection database, weconstructure triangle M matrix, so we can greatly reduce the numbers and sizes of the projectiondatabase, it can also reduce the time of scanning projection databases. Finally, I introduce the concept of sequential pattern values, we can reorder the results of sequence which is based on thesize of the sequential pattern values. Thus, it could be possible to find the most importantsequence patterns.(2) Make experiments to verify the efficiency of the BLSPM algorithm.Firstly, comparingof the mining result of two algorithms, we could find that BLSPM algorithm could find the mostimportant sequence patterns, thus to meet the actual demand. Secondly, we make threeexperiments to verify that the BLSPM algorithm is better than PrefixSpan algorithm from thefollowing asperts: different supports, different types of datasets, and different sizes of datasets.(3) Propose the BLSPM of Map-Reduce algorithm. In practical applications, in the face ofhuge datasets, the efficiency of the BLSPM algorithm is facing bottlenecks. Therefore, wepropose BLSPM of Map-Reduce algorithm. By way of distributed processing, we put large tasksinto multiple smaller tasks, then do sequence pattern mining in parallel on each namenode. Thenwe make experiments to verify the efficiency of the algorithm. The first experiment is to verifythe speedup of the algorithm between single platform and Hadoop. The second experiment is toverify the efficiency of the algorithm in different sizes of the datasets. From two experiments, wecan find that this algorithm could be able to improve the efficiency in face of large datasets.(4) Put the algorithm into actual application of the product placement in supermarket. Inorder to verify the practicality and efficiency of the algorithm, we put the BLSPM algorithm intoactual application. By analyzing historical sales data of supermarket, we clean data, sample dataand do a series of operations, then we transform it into sequence database, finally we use thealgorithm to data mining.This experiment has two stages. In the first stage, the sequences ofproduct categories are mined to place the product categories. In the second stage, the productsare mined for each category, we add profit targets for calculating sequential pattern values.Sothat it can reorder the results to find the most profit products. thus, we can adjust productsplacement and improve the profit of supermarket.
Keywords/Search Tags:Sequential pattern mining, Projection database, PrefixSpan, Large datasets, Map-Reduce, Product placement
PDF Full Text Request
Related items