Font Size: a A A

Research On Techniques Of Sampling Large Graphs With Multiple Estimation Goals

Posted on:2021-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L ZhangFull Text:PDF
GTID:1480306107955419Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the data in graph based applications,the analysis based on the whole datasets becomes challenging in view of the limited resources of computation and storage.Thus,graph sampling is an important technique to analyze characteristics of large graphs in various applications.Large graphs in real applications can be concluded as two forms.The first is called edge-centric graphs(or graph streams)in which each edge carries the information about interaction between one node(entity)and another node(entity)while the second is called node-centric graphs for which the neighbors of the current web can be easily obtained.Accordingly,there are two types of graph sampling: edge-centric graph sampling and node-centric graph sampling.However,with the increase of the scale of the graphs,there are challenges for the two types of graph sampling techniques,including limited estimation goals,high estimation errors and costs which resulting in unavailability in many applications.To settle these problems,this thesis proposes different sampling techniques to obtain more accurate estimations on various characteristics of large graphs at smaller costs than existing sampling techniques over large graphs.This thesis proposes a new dual sampling method,called triangle-induced reservoir sampling,or T-Sample,to count triangles and estimate node degrees simultaneously and efficiently for edge-centric graphs.The existing reservoir-based sampling methods cause high resource consumption and large estimation errors,making them ineffective.While every edge in a graph stream is still processed only once by T-Sample,a dual sampling mechanism performing both uniform sampling and non-uniform sampling is carefully designed with a base reservoir and an incremental reservoir whose total capacity can be judiciously controlled.Specifically,T-Sample's uniform sampling is used to count triangles by a newly proposed method with smaller estimation variances than existing reservoir-based sampling methods;whereas,its non-uniform sampling ensures that edge samples are connected.Experimental results driven by real datasets show that T-Sample cuts down the estimation error and variance by 50% and 56% on estimations of count triangles than the state-of-the-art reservoir-based sampling methods while obtaining 99% accurate information about node degrees at smaller time and memory costs.This thesis proposes a new random walk based sampling method,named NCRW(node clique random walk)for node-centric graphs.NCRW is to settle the problem of low quality samples and then the unavailability of the query applications caused by the existing random-walk based sampling techniques.The key step of the existing random-walk based sampling methods is to select the next sample from the neighbors of the previously sampled node.Thus,they produce many repetitive and similar samples which leads to a large deviation from the desired node distribution and large estimation errors.Different from the existing random-walk based sampling methods,NCRW traverses from one node clique to another to a large graph.During the sampling process,NCRW employs the strategy of nonbacktracking to eliminate repetitive samples and meanwhile enlarge the sampling space to reduce similar samples.The deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally.Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW produces few repetitive samples and cuts down the similar samples by10%,and thus NCRW can increase the estimations on the distributions of degree and cluster coefficient by 10% and 6.8% respectively compared to the existing random-walk based sampling methods.This thesis proposes a new random-walk based method,called two-hop neighbors based random walk or 2-Hopper for node-centric graphs,to estimate both individual and social attributes of social networks with fewer repetitive samples.The existing random-walk based sampling methods mainly focus on either individual or social attributes but not both and they suffer from repetitive samples which have adverse effects on obtaining accurate information about social networks(i.e.,the formation of a special structure).The idea behind2-Hopper is to enlarge the selection space for each sampling step while reducing redundant paths to prevent the sampling process from backtracking to the already sampled nodes.By analyzing the process of 2-Hopper,a re-weighted estimator is proposed to accurately obtain the structural properties of social networks.Thus,2-Hopper can obtain more accurate information on the formations of three special structures of large graphs and meanwhile more accurately estimate the structural properties of both individual and social attributes over large graphs.Experimental results driven by real-world datasets show that 2-Hopper can produce more than 89.9% samples,that is 25% higher than the existing random-walk based sampling methods,to obtain more accurate information on the formations of the structures of large graphs while more accurately estimating the structural properties of both individual and social attributes over large graphs.This thesis proposes a dual random-walk based sampling method,called DRa WS by designing a dual residence of the random walker,to estimate both the distributions of degree and clique size with low costs for node-centric graphs.Existing random-walk based sampling methods cause inaccuracies in estimating the degree structure and high sampling costs in estimating clique structures.The key idea behind DRa WS is that it leverages the many-to-one formation between many nodes and one clique in a large graph to shorten the sampling paths and thus reduce the sampling costs greatly while reflecting the different sampling probabilities of the two types of node structures.Meanwhile,DRa WS employs the one-to-many representativeness between one node and many nodes in a clique to improve the quality of samples.Furthermore,two re-weighted estimators for DRa WS's process are proposed to estimate the two different node structures.Experimental evaluation driven by real graph datasets shows that DRa WS cuts down the sampling costs of the state-of-the-art methods while increasing at most 26% and 50% higher accuracy when estimating both the degree and clique structural properties.
Keywords/Search Tags:Reservoir-based sampling, Random-walk based sampling, Properties of large graphs, Estimator
PDF Full Text Request
Related items