Font Size: a A A

Similarity Query And Pattern Mining On Data Streams

Posted on:2009-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:J K GuoFull Text:PDF
GTID:1118360272458838Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the domain of data mining is becoming more and more widely,and the content of its is becoming deeper and deeper,people find that many data are produced with the manner of streams,such as Web streams,Web click streams,traffic streams and sensor datas,etc.Studying data stream has become a hot research field in data mining society,and has many important applications.Estimating similarity on data streams and finding pattern hidden in the stream are two important tasks.Much work has been done about how to estimate similarity and patterns on data streams.However,there exists lots of work to be done.We focus on the features of data in the stream context and propose a framework to deal with similarity evaluation.And furthermore we address the problem by analyzing Web data,finding information and mining patterns hidden in them with two algorithms of finding Web access patterns.The main contributions of this thesis are as follows.We propose an efficient technology,i.e.,ETSEDS(an Efficient Technology for Similarity Evaluation on Data Streams),to process similarity queries on data streams under Lp-norm.ETSEDS technology captures the main characteristics of stream data,and exploits a novel tree structure,i.e. SDS-Tree(the Same-Directed Slope Tree),to stratify and construct stream data on sliding windows.Moreover,based on the SDS-Tree structure,we propose two efficient algorithms,i.e.ASQFSW(Algorithm for Similarity Queries in Fixed Sliding Window) and IASQSW(Incremental Algorithm for Similarity Queries in Sliding Windows),to process the similarity queries on single-fixed sliding window and general sliding windows respectively. Specially,IASQSW algorithm can realize similarity queries by updating a spot of nodes on SDS-Tree structure.Furthermore,we present detailed theoretical analyses and extensive experiments that demonstrate our algorithms,which are both efficient and effective.We propose a new algorithm ESDS(Estimating Similarity on Data Streams),which can estimate similarity efficiently on data streams under the time warping distance.In order to evaluate the efficiency of our algorithm,we present a simple but efficient method of Segment to denote the original stream data and the Segment method just need three data to express the stream,which are the max data,min data and the difference between them.To improve the efficiency of Segment,we study a new type of stream which is called Surge Stream.We also propose a new method to judge whether a stream is a Surge Stream.However,Surge Stream sometime will become very regularly in the original stream.So we propose the concept of Valid Surge Stream and Max Valid Surge Scope and design algorithms which can be used to find them respectively.With the help of the above concept,our Segment algorithm can detect the features of the original streams efficiently.In computing the distance of DTW between data streams by using dynamic programming,we also design a new distance of DTW which can compute the similarity on data streams efficiently.The experiments on many real and synthetic data sets show that our algorithm can evaluate the similarity on data streams efficiently and not be studied in the previous research.Discovering interesting Web access patterns from Web logs is a Web usage-mining problem with many practical applications.For this problem, some conventional algorithms,such as GSP,HPrefix and WAP-mine,can be used to mine WAP from Web logs.WAP-mine Algorithm mines Web access patterns by storing the original Web access sequence database and then generates the frequent patterns by recursively mining the intermediate Web access pattern trees.Though WAP-mine scans the database only twice,it needs to recursively build intermediate data and will suffer strongly from such building activities especially on low support thresholds.Because in the process of mining frequent Web access pattern,WAP-Mine will generate much intermediate data,at the lower support the efficiency is very low.To address this issue,we propose two new algorithms based on the top-down manner for mining Web access pattern.First,based on the WAP-tree structure, which is proposed by WAP-mine,we propose a new method TD-WAP-mine which can be used to find Web access patterns in a Top-Down manner. TD-WAP-mine can efficiently find frequent patterns at the low support because it doesn't need generate too much the intermediate WAP-tree. Second,we give a compressed structure of WAP according to the features of WAP and design a new method of how to find the compressed WAP-tree.We also give a new strategy which is called Projection method to find frequent patterns.Instead of stubbornly building intermediate data for each step of mining process,our algorithm selectively builds intermediate data according to the features of current area to be mined.The experimental results on various real world and artificial datasets show that our two algorithms greatly reduce the efforts to build intermediate data and in general offers a better performance than WAP-mine.
Keywords/Search Tags:Data Mining, Data Stream, Similarity Search, Web Access Pattern, Lp-norm, DTW, Frequent Pattern
PDF Full Text Request
Related items