Font Size: a A A

Study And Implementation On Techniques Of Parallel Mining Of Frequent Closed Sequences Based On Vertical Segmentation

Posted on:2016-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:T C BiFull Text:PDF
GTID:2428330542957391Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Sequential pattern mining is an important part of data mining.After Agrawal and Srikant raising the concept of sequence,more and more researchers are taking part in this subject.When it comes to application,sequence mining has a widely used.It can be used in Market Analysis,Fraud Detection,Scientific Exploration,Product Control and so on.With the development of data mining,it will play a huge role in more fields.With the development of web 2.0,the information explosion has become worse,and it brings a huge challenge to sequential pattern mining.When facing big data,which means we can not put the whole data into a single computer,how can we mining sequential pattern.Many parallel algorithms need to generate candidate sequential pattern,the others has not to do this.But both of them rely on physical memory,once the original data can not fit the memory,we could not run the algorithm any more.The contribution of this thesis as follows:(1)According to our current knowledge,it is the first time that we propose the concept of vertical segmentation of Sequence Mining.The time complexity of this algorithm is related to the number of colums.We first intersect each of the two sequences,it helps to decrease the length of the sequence.After that the original sequence is consisted of many shorter pattern,we select K sequences which are different to each other.(2)In order to mining in a small dataset,most of the algorithm of sequential pattern mining compress original data when the data are huge.In this thesis,we present the concept of pattern compression,compressing pattern has lots of benefits such as reducing the scale of enumeration,shortening the time of mining,reducing time complexity.(3)Considering data can not fix in memory when it comes to big data,we improve the algorithm which rely on physical memory.In each job of MapReduce,we only mining a fixed length of sequential pattern,although it is not as efficient as the older algorithm,it helps to solve the problem that the dataset can not fix the memory.(4)Our algorithm is based on Hadoop which is a parallel framework.First of all,we distribute data to different nodes in cluster.According to the feature of map and reduce,we rewrite the algorithm running on PC.Since the candidates are independent with each other,our algorithm achieves high speed-ups.
Keywords/Search Tags:Data Mining, Pattern Mining, Pattern Compression, Parallel Mining, MapReduce
PDF Full Text Request
Related items