Font Size: a A A

Primitive Operations in Computing Massive Data Revisite

Posted on:2018-12-08Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Wei, HaoFull Text:PDF
GTID:2448390005951640Subject:Computer Science
Abstract/Summary:
With the massive size of data from various sources, achieving high efficiency for even the primitive operations in big data can be challenging. In this thesis, we investigate the primitive operations in processing two widely used data types, graph and string. Graph can model the online social network, hyperlink graph, and knowledge graph etc., while string is widely used to represent textual data including document,message, and DNA sequence.;We first investigate how to optimize the graph computing in a general way. During our research on various kinds of graph algorithms, we observe that CPU cache performance is the key to the efficiency of graph computing. Motivated by this observation, we analyze the common access pattern of various primitive graph operations, such as Breadth First Search, Depth First Search, PageRank, etc., and formulate the graph ordering problem that aims to utilize the graph data locality to reduce the CPU cache miss ratio based on the common access pattern. We design efficient algorithms to compute the graph ordering and the experiments show that the optimized graph ordering can improve the CPU cache performance for most graph operations. Note that the graph algorithms can get speed up from the graph ordering without the requirement of modifying its implementation and data structures.;We further study the probabilistic data structure and design hash-based labeling schemes for efficiently answering reachability query in graph data and similarity search query in string data. Graph reachability query is a primitive graph operation that checks whether a node can reach another node over the graph. We propose IP+ labeling, a novel randomized labeling approach designed based on min-wise independent permutation technique. The index size and index construction time of the proposed IP+ label are kept to be linear with the graph size. The IP+ label can guarantee that two unreachable nodes can be answered directly by their IP+ labels with a high probability and no graph traversal is required in most cases. Extensive experiments are conducted to confirm its efficiency. In addition, we further extend our IP+ labeling to handle dynamic graph reachability problem and propose label maintenance algorithms to deal with the insertion and deletion of nodes and edges. Another data type investigated is string and we study string similarity search problem, which is to find the similar strings with a small edit distance to the query string from the string dataset. We adopt the existing filter-and-verify framework, where the filter step is introduced to prune many dissimilar strings by the index built offline. Two new hashbased labeling schemes, named OX and XX, are proposed for string similarity search, and the dissimilar strings can be effectively pruned by comparing their hash-labels. Compared with the state-of-the-art approaches, the proposed hash-based labeling approaches can significantly improve the query performance, and at the same time its index size and index construction time can be at least one order of magnitude smaller than the state-of-the-art approaches.
Keywords/Search Tags:Data, Primitive operations, Graph, Size, CPU cache, Index, Computing, String
Related items