Font Size: a A A

Research On Similarity Search In Large-Scale Graph Databases

Posted on:2020-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:1360330602950304Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Speaking of powerful data structures,graphs can capture the structure of target objects and the attribution of vertices and edges,which have been widely used to model objects in many disciplines,such as bioinformatics,computer vision,software engineering and social networks.Furthermore,with the growing development of information technology,the obtained graph data increases dramatically.Thus,effective analysis and management of graph data become increasingly important.Compared with exact search in graph databases,similarity search can provide a robust solution that permits error-tolerant and supports for searching not precisely defined patterns.Due to the excellent characteristic of the graph edit distance(GED),that is,this measure is applicable to virtually all types of data graphs and can also capture structural differences,this dissertation focuses on the GED-based graph similarity search problem.Specifically,given a query graph Q and a threshold ?,graph similarity search reports all data graphs in the database D that has an edit distance with Q less than or equal to ?.Due to the NP-hardness of GED computation,similarity search algorithms always adopt the “filtering-and-verification” framework.In the filtering phase,GED lower bounds are employed to prune as many false positive graphs from D;this phase can be efficiently accomplished with specified index structures.The remaining unpruned graphs constitute a candidate set and are validated with expensive GED computations in the verification phase.Observing limitations of existing indexing methods(e.g.,loose GED lower bound,large index size and limited scalability)and drawbacks of GED computation methods(e.g.,large search spaces,excessive memory requirements and many expensive backtracking calls),this dissertation proposes new index structures and GED computation methods to improve the performance of graph similarity search.The specific works are summarized below.The first part studies the external index structure for graph similarity search.In view of limited scalability of existing indexing methods when they handle very large graph databases,a novel external index framework for arbitrary q-gram-based representations of a graph is presented to achieve efficient query processing,by converting the q-gram counting filter into a sparse matrix-vector multiplication(Sp MV)problem.Specifically,the q-gram matrix index is proposed and stored in hybrid layout in external memory for efficiently solving the Sp MV.Furthermore,each graph is transformed into a vertex-edge2 D point as well as the global counting filter is converted to a 2D query rectangle to boost the query performance,which allows the similarity search to be performed only in a reduced region,thereby significantly reducing the query I/Os in practice.Extensive experiments on real data sets confirm that the proposed external q-gram matrix index can easily applied to the large dataset of 25 million chemical structure graphs from the Pub Chem dataset and takes much fewer query I/Os than the popular q-gram-based external inverted index.The second part focuses on the succinct index structure for graph similarity search.Considering the limited filtering ability and large index space of existing indexing methods,a succinct q-gram tree index is proposed,which incorporates succinct data structures and hybrid encoding to achieve improved query time performance with minimal space usage.Specifically,the space usage of the proposed index requires only 5%–15% of the previous state-of-the-art indexing size while at the same time achieving several times acceleration in query time on the tested data.In addition,two effective GED lower bounds(i.e.,the degree-based q-gram counting lower bound and degree sequence lower bound)and a novel boosting technique are proposed to obtain the possible smallest candidate set.Extensive experiments on real and synthetic datasets demonstrate that the proposed succinct index is superior both in space and filtering to the state-of-the-art methods.To the best of our knowledge,this index is the first in-memory index that successfully scales to cope with 25 million chemical structure graphs from the Pub Chem dataset.The third part investigates the GED computation method.Existing solutions for computing the GED suffer from several drawbacks: large search spaces,excessive memory requirements,and many expensive backtracking calls.A novel vertex-based mapping method is introduced to calculate the GED in a reduced search space created by identifying invalid and redundant mappings.This method employs the beam-stack search paradigm,a widely utilized search algorithm in AI,combined with two specially designed heuristics to improve the GED computation,achieving a trade-off between memory utilization and expensive backtracking calls.Extensive experiments confirm that the proposed method is highly efficient on both sparse and dense graphs and outperforms the state-of-the-art methods.Furthermore,the proposed method is extended to solve the GED-based graph similarity search problem.The experimental results show that the extended method is dozens of times faster than state-of-the-art graph similarity search methods.The fourth part studies the approximate algorithm for GED computation.The NPhardness of GED computation limits the application of GED on large and complex graphs.An improved anytime algorithm that computes the approximate GED is proposed,which considers the running time as a parameter and outputs a sequence of improved suboptimal solutions of GED to meet different time requirements.This method can fast output an initial suboptimal solution with the proposed neighbor-bias greedy matching method;and it then gradually improves the obtained suboptimal solutions using the tree-based search algorithm that incorporates with a more efficient heuristic.The experiments on real and synthetic datasets confirm that the proposed method achieves much smaller deviations than the state-of-the-art approximate methods under both small and large running time settings.
Keywords/Search Tags:graph edit distance, graph similarity search, external index, succinct index, beam-stack search, anytime algorithm
PDF Full Text Request
Related items