Research On Large-scale Graph Analysis Algorithms For Constrained Storage

Posted on:2023-09-28

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X L Wan

Full Text:PDF

GTID:1528306839981579

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As graph models are widely used in various of applications,the size of graph data grows tremendously.The scale of current graph data can easily surpass GB and reach TB,EB level.Using internal graph analysis algorithms to process large-scale graph data has a high cost of memory space.Correspondingly,using external graph analysis algorithms needs to ensure that they are operable for computing devices with limited memory,which can hardly meet the efficiency requirements for large-scale graph analysis.In the meantime,the continuous increase of the scale of graph data also challenges the indexbased graph analysis algorithms.For example,certain subgraph query algorithms need to maintain indexes in the main memory,whose sizes are basically equal to the size of input graph.The sizes of the external indexes of certain reachability query algorithms grow exponentially,with respect to the scale of the input graph.This thesis discusses the research difficulties of traditional graph analysis problems,including depth-first search,strongly connected component computation,subgraph query,and reachability query,within constrained storage.Since both subgraph query and reachability query are often related to special graph models,and knowledge graph is a typical graph model,this thesis selects knowledge graph model as their specific application scenario.Specifically,the main research contents and research results of this paper are listed as follows.First,this thesis proposes an efficient depth-first search algorithm,named EP-DFS,for large-scale graphs with constrained memory.Different from traditional external environment,semi-external environment sets a lower limit for the size of the memory space of the computing device.Specifically,current semi-external depth-first search algorithms maintain a spanning tree for in the main memory,and continuously restructure until is a depth-first search tree of .It is analyzed in this thesis that since the depth-first orders of the nodes on are unpredictable during the process of restructuring ,traditional algorithms need to scan many times,involve complex CPU calculations,or use numerous random disk accesses.To address this problem,EP-DFS adopts a special method to restructure ,and can fix the depth-first orders of a large number of nodes on in its earlier stage,with less I/Os,simpler CPU calculations and less random I/O accesses.Experimental results on both synthetic and real datasets confirm that the CPU cost and I/O cost of EP-DFS are both an order of magnitude lower than traditional algorithms.Second,this thesis also proposes an efficient semi-external strongly connected component computation algorithm,named EP-SCC,for large-scale graphs with constrained memory.It is analyzed in this thesis that in order to reduce the memory space usage,traditional semi-external algorithms for strongly connected component computation only maintain two attributes for each node of G in the main memory.It causes that traditional algorithms have high CPU and I/O consumption,since they cannot compute the reachability relationship for any two nodes of in (1)time and have to use some complex optimizations in their query process.Thus,for EP-SCC,this thesis proposes a novel inmemory procedure,named EP-Reduction.Based on this procedure,EP-SCC can compute the reachability relationship between any two nodes of in (1)time without optimization,when only maintaining two node attributes for each node of in the main memory.To evaluate the performance of EP-SCC and traditional algorithms,this thesis conducts experiments on both synthetic and real datasets in which dataset WDC-2014 contains more than 1.7 billion nodes,and dataset eu-2015 has over 91 billion edges.Experimental results confirm that EP-SCC significantly outperforms traditional algorithms.Third,this thesis proposes an algorithm,named LKAQ,for answering representative subgraph queries on large-scale knowledge graphs with memory constraint.It is analyzed in this thesis that with the increase of the scale of graph data,the indexes designed in traditional algorithms for knowledge graph can hardly be maintained in the main memory.In addition,existing subgraph query algorithms on knowledge graphs still aim to find the full result set of the input query.However,the full result set of a subgraph query on may contain numerous subgraphs of ,which poses challenges to the follow-up work of analyzing .Thus,some applications prefer a representative subset of rather than the whole .Therefore,to efficiently process subgraph queries on with memory constraint,and return a representative result set for the input query,this thesis designs two light-weight indexes on knowledge graphs,i.e.I-Index and M-Index.Based on these two indexes,LKAQ can achieve an automatic balance between efficiency,memory cost and the size of the representative result set,during the query process.The experimental results on both real and synthetic datasets confirm that,LKAQ outperforms traditional subgraph query algorithms on knowledge graphs,and the performance of LKAQ is still good when the available memory space is significantly smaller than the size of .Finally,this thesis proposes an algorithm,named INSK,for answering label-and substructure-constraint reachability(LSCR)queries on knowledge graphs with limited external space.For one thing,with the growth of the scale of current graph data,the sizes of the external indexes of traditional algorithms for reachability queries are huge,which can hardly be afforded by current external storage devices.For another,reasoning on largescale graphs is no longer limited to traditional unconstrained reachability queries or labelconstraint reachability(LCR)queries,and often related to LSCR queries.It is analyzed in this thesis that existing solutions for LSCR queries on knowledge graphs rely on online search algorithms,and have a fixed search direction,so that they cannot be performed efficiently.Therefore,this thesis designs a light-weight index,named L-Index,for INSK.Based on L-Index,this thesis develops a complex heuristic evaluation function.Thus,INSK can break the fixed search direction of traditional uninformed search algorithms,and efficiently process the input query.The theoretical analysis and experimental results of this thesis confirm that the external space cost of L-Index is much smaller than that of traditional LCR algorithm.In the meantime,the experimental results also confirm that the query cost of INSK on both real and synthetic datasets is an order of magnitude lower than that of the optimized algorithms of the existing solutions for LSCR queries on knowledge graphs.

Keywords/Search Tags:

Constrained storage, Depth-first search, Strongly connected component, Subgraph query, Reachability query

PDF Full Text Request

Related items

1	Studies On Reachability Query Technologies For Large Graph Data
2	Research On Subgraph Search Technology In Graph Database
3	Research Of Query Technologies On Graph Data
4	Research On Exact Subgraph Search Technology In Graph Database
5	Research On Algorithms For Subgraph Query Based On Graph Databases
6	Research On Frequent Subgraph-based Graph Query Techniques
7	Research On Techniques For Query Processing On Graph Databases
8	Research On Algorithms For Subgraph Query Based On Single Neighborhood
9	Research Of Subgraph Isomorphism Search Algorithm On Graph Data
10	Research On Query Processing Method For Large Scale Graph