Font Size: a A A

A Study On Parallel Mining Of Continuous Sequential Pattern

Posted on:2016-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:M J PengFull Text:PDF
GTID:2428330482481286Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the continuous development of the informatization level of the society,the information plays an increasingly important role in modern social life.Continuous sequential pattern algorithm can find continuous frequent sequential patterns from the sequences of target.The traditional sequential pattern algorithms are available in the field of retail business,network communications,finance,weather analysis.However,they allow the frequent items that found in sequences are saltatory and non-continuous.In this regard,Continuous-PrefixSpan serial algorithm alters the definitions of sequence,prefix,suffix and projection in original algorithm.Only when the first element of the sequence to be projected equivalent to the last element of prefix,it will be selected by projection database,this is the way to ensure the result is continuous.The development of the informatization level also brings massive data.Relational databases such as Oracle,SQL server are useless when face the massive data sets which are TB or even PB size.Meanwhile,the sequential mining algorithms need to scan the original database several times,which is exactly the weakness of traditional relational databases.Based on the above,this paper presents a parallel data mining and storage solutions which is based on Hadoop platform.As an open source software parallel platform for program development,Hadoop has the Map/Reduce parallel programming model that allows multiple computers involved in the calculation at the same time,which is greatly reducing the time of processing.Its parallel file system HDFS uses each memory space of datanodes in the cluster to save data and replicate it to other datanodes.That is the way HDFS solve the problem about the lack of memory space when face the mass data,this also improve safety of the data.This paper focuses on improving PrefixSpan algorithm on Hadoop platform and has designed a appropriate Map/Reduce algorithm for parallel continuous sequential pattern mining with two times of Map/Reduce.The algorithm also ensures the process of breadth-first search algorithm can work parallelly.On this basis,the paper also introduces Hive-a component of Hadoop platform to preprocess the data parallelly,which is in order to make the entire mining process parallelly.It is significant to successfully transplant the traditional serial sequential pattern mining algorithm on Hadoop platform.The Hadoop platform can take fully advantage of the computing power and storage of each datanodes,which is efficient,lowcost and has high application value.
Keywords/Search Tags:Data Mining, Continuous Sequential Pattern, PrefixSpan, Hadoop, HDFS
PDF Full Text Request
Related items