Font Size: a A A

Research On Graph Data Based Sequential Pattern Mining

Posted on:2021-04-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:M T LeiFull Text:PDF
GTID:1368330605481243Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In these years,with the rapid development of the Internet,massive valuable graph and network data have been produced.The data has strongly correlated relationship.In social networks,these relationships may represent the information transmission between users.In network security,these relationships may represent attacking and defending behaviours.In citation networks,these relationships may represent authors'citation relationships.Such relationships are called by sequential patterns,and mining sequential patterns in large graph data enjoys various applications.On the one hand,it would be interesting to find the same patterns that are shared by a specific dense graph component.These patterns can be used for classification and prediction.On the other hand,it can be used to prune unmeaningful patterns and may benefit of the mining efficiency.To this end,we study the sequential pattern mining problem in large graph data from the angles of detection,refinement and usage of patterns,i.e.,mining top-k sequential patterns in graphs,efficiently approximating top-k sequential patterns in graphs,finding route hotspots in large networks and distributed subgraph matching with sequential patterns.Specifically,we make the following contributions.First,we study the problem of mining top-k sequential patterns from the perspective of the sequential pattern detection.In many real-world networks,a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex,leading to huge computational overheads.To model such type of scenario,we propose the novel notion of the transaction database graph,where each vertex is associated with a transaction database.Every path of the graph is a sequence of vertices that induces multiple sequences of transactions.The sequences of transactions induced by all of the paths in the graph form an extremely large sequence database.Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a transaction database graph.However,it is challenging since the sequence database induced by a transaction database graph is too large to be explicitly induced and stored,and finding the top-k frequent sequential patterns is#P-hard.To tackle this problem,we carefully design a two-step sampling framework that significantly improves the mining efficiency.We first sample a set of transaction sequences from the transaction database graph by a two-step sampling algorithm.Using the sampled transaction sequences,we design an unbiased estimator to approach the frequencies of sequential patterns in the transaction database graph with provable guarantees on the estimation error.Based on the estimator,we introduce the weighting mechanism for the sampled transaction sequences.Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method.Second,we extend the above problem and study the problem of efficiently approximating top-k sequential patterns in graphs.In real-world networks,an edge(or vertex)may be associated with a transaction database,where each transaction consists of a set of items to describe the attributes or behaviors of the edge(or vertex).However,this task is not trivial due to a large number of sequences existed in the graph.In particular,each path of the graph may induce multiple sequences of transactions,and gathering the transaction sequences induced by all the paths may lead to an extremely large volume.To efficiently approximate the top-k patterns,we propose a parallelized sampling-based approach for mining top-k sequential patterns,PSMSP,which involves two key techniques:(a)a parallelized unbiased sequence sampling approach based on a sequence balanced graph partition strategy,and(b)a novel PSP-Tree structure to efficiently mine the patterns based on the anti-monotonicity properties.The experimental results on synthetic and real-world datasets demonstrate that the PSMSP can successfully and the sequential patterns with superior efficiency.Third,we combine the sequential patterns with dense subgraph structures,and study the refinement of sequential pattern mining problem.In this thesis,we investigate the formation of hotspots from an alternative perspective that considers the routes along the network paths as the auxiliary information,and attempt to find the route hotspots in large labeled networks.A route hotspot is a cohesive subgraph that is covered by a set of routes,and these routes correspond to the same sequential pattern consisting of vertices'labels.However,it is challenging as counting the number of hotspots in a network is#P-hard.Inspired by the observation that the sizes of hotspots decrease with the increasing lengths of patterns,we prove several anti-monotonicity properties of hotspots,and then develop a scalable algorithm called FastRH that can use these properties to effectively prune the patterns that cannot form any hotspots.In addition,to avoid the duplicate computation overhead,we judiciously design an effective index structure called RH-Index for storing the hotspot and pattern information collectively,which also enables incremental updating and efficient query processing.Our experimental results on real-world datasets clearly demonstrate the effectiveness and scalability of our proposed methods.Fourth,we study the distributed subgraph matching problem that applies the sequential patterns for the efficiency.To address this issue,we propose GiraphIso,a distributed subgraph matching framework based on Giraph++,a variation of Apache Giraph.Under such framework,the problem is solved in a decomposition-and-combination manner.Both the query graph and the data graph are partitioned to perform distributed graph matching,and the intermediate results from each partition are joined to answer the original queries.On each data graph of each partition,we propose to use the sequential patterns to prune unqualified vertices and edges that can reduce the search space for subgraph matching.Our experimental results on real-world graph data demonstrate the effectiveness and efficiency of the proposed framework and the optimization techniques.
Keywords/Search Tags:graph mining, sequential pattern mining, community detection, subgraph matching, subgraph isomorphism
PDF Full Text Request
Related items