Font Size: a A A

Lossless Compression Of Large Scale Graphs With Query Support

Posted on:2018-03-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1318330512985357Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Graph is a data model that can represent relationships among entities.Compared with structured and non-structured data,the graph data model is more diverse,more flexible and easier to implement.With the development of the Internet and various data collection and transmission techniques,more data are modeled in the form of graphs to represent connections among different nodes,users and entities in networks.At the same time,the sizes of graphs that need to be managed in search engines,social network services,science data analysis,knowledge network analysis and other applications are growing.Traditional graph compressing algorithms often need to store the entire dataset in memory.Therefore,it is an important problem in data management to reduce the storage cost of graph data.In graph data compression methods,both the compression ratio and the processing efficiency are necessary factors to be considered.On the other hand,when an application processes compressed graph data,it is also crucial to avoid unnecessary decompression operations to support various type of graph queries in order to improve the data processing efficiency.Aiming at solving the problems stated above,this thesis studies lossless compression methods of large-scale graph data that support graph queries.In particular,contributions of this thesis include the three following points:· This thesis presents a lossless compression method based on triangle listing,named bound-tri.This method utilizes a large number of triangles in the graph,and merges the nodes in these triangles,to produce a lossless compression result.The bound-tri algorithm filters out high degree nodes from the triangle listing,which reduces graph compression time.Meanwhile,bound-tri ensures that there are only a few redundant edges in the result,which leads to a low compression ratio,bound-tri also allows users to set different filter parameters for compression to balance the compression ratio,compression time and the processing time of graph queries over the compressed data.The bound-tri method introduced in this thesis can performlossless compression efficiently over large-scale graphs,especially social networks.Comparing with most existing lossless graph compression methods,bound-tri has a higher compression ratio and shorter compression time.· In this thesis,a lossless compression method named STT is proposed to com-press streaming graphs.STT performs lossless compression using the nodes that appear multiple times in the streams of the past and the edges connecting these nodes.STT tries to store hot nodes as well as the edges or triangles containing the nodes in a fixed-size streaming cache.Nodes and edges in the cache are grouped into triangles by the triangle listing algorithm to compress streaming graphs.This tech-nique achieves a lower compression ratio with high streaming throughput.Mean-while,there are a certain number of triangles in the compression result to meet the demand of fast query processing without decompressing the graph.STT does not re-quire the input of the complete graph topology or the full graph during compression.In addition,users can change the cache update efficiency by adjusting parameters,and compress streaming graphs with different rates,or increase the throughput of graph compression at the expense of compression ratio.Compared with the current static graph compression methods,the STT algorithm can obtain a similar compres-sion ratio to that of RVE,Re-Pair and VNM.The compression operation can be started when the streaming is gradually sent to the system,improving the compres-sion efficiency greatly.· This thesis proposes query processing algorithms on answering three graph queries,i.e.,adjacent nodes query,k-distance nodes query and common neigh-bor nodes query.These algorithms ensure that query results can be returned with-out decompressing the graph that has been compressed by triangle listing.The graph compressed by bound-tri and STT contains a large number of triangle subgraphs,which can be extracted easily.Using the information of nodes and edges stored in triangle subgraphs,the algorithms can answer graph queries in a short time,avoiding the expensive decompression operations.Based on these algorithms,it can handle graph data analytic tasks such as sampling social network data,user recommen?dation,information spread analysis and others.In contrast,graphs compressed by other methods,such as SLAHSHBURN,BV schema and ?2-partitioned,needs to be decompressed partially or completely before query processing algorithms can answer these queries.To sum up,this thesis studies the problem of lossless compression for large-scale static graphs and streaming graphs.It achieves efficient graph compression by node merg-ing operations and large number of triangle subgraphs in the original graph.In static graphs,it filters out high-degree nodes and run triangle listing on the remaining nodes during compression processing,achieving a higher compression ratio and taking less time.In streaming graphs,hot nodes are maintained in a streaming cache.The cache stores asmany triangles as it can in streaming graphs.Therefore,the graph can be compressed without storing the entire graph locally.Additionally,several query processing methods on compressed graphs are studied and be prove that graphs compressed by bound-tri and STT can be processed in a short time without decompression.Compression experiments on multiple real-world large-scale datasets and compari-son with other graph compression methods show that bbound-tri and STT have three ad-vantages:high compression ratio,good compression performance and query supporting.
Keywords/Search Tags:Graph, Streaming Graph, Lossless Compression, Graph Query, Trian-gle Listing
PDF Full Text Request
Related items