Font Size: a A A

A Schema Feature Based Frequent Pattern Mining Algorithm For Semi-structured Data Stream

Posted on:2018-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:W Q FuFull Text:PDF
GTID:2348330563452503Subject:Master of Engineering / Software Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology,massive data is constantly generated.The analysis of the data is no longer a task that can be completed by manpower.To solve this problem,people proposed data mining technology to discover useful information from massive data.Frequent pattern mining is an important task in data mining.Frequent pattern refers to a data fragment that are repeated in the data.And frequent pattern mining refers to finding these frequent patterns from massive data.In the studies of frequent pattern mining,researches on frequent pattern mining for semi-structured data have made some progresses,and researches on frequent pattern mining for data stream also have a lot of focuses.However,only a few studies focus on both semi-structured data and stream data.Therefore,how to efficiently and accurately mine frequent patterns of semi-structured data stream has become the focus of this paper.Semi-structured data stream is real-time,ordered,infinite,continuous and it also has the tree structure.This paper proposed a mining model based on time window which can be used to mine semi-structured data stream.The mode serializes and segments the semi-structured data stream first,then mines each segment of data by the SPrefixTreeISpan algorithm proposed by this paper.In the end,all the mining results will be maintained by a structure called patternTree.And to solve the problem of incorrect mining caused by segmenting,this paper proposed a structure called checkStack and a mining strategy.This paper uses XML data stream as the mining object.Sine there is usually a Schema document to describe the XML data structure,by analyzing the Schema,the inevitable parent-child relationship and the inevitable child-parent relationship can be extracted and be used to optimize the SPrefixTreeISpan algorithm.Experiment shows that the algorithm has better performance and the optimization strategy based on Schema feature is effective.
Keywords/Search Tags:Frequent Pattern Mining, Semi-Structured Data Stream, Schema Feature
PDF Full Text Request
Related items