| Graph sampling,as a key technique for simplifying largescale networks,extracts a proportion of nodes and edges from the original graph to form a representative small-sized sample graph.The sample graph can be used to estimate topological attribute of the original graph,accelerate graph computation,and improve the readability of graph visualization.Community structure of the original graph is supposed to be retained in graph sampling since community structure is the core topological feature of a graph.However,graph sampling can be affected by community imbalanceness.That is,when a graph contains multiple imbalance-sized communities,graph sampling usually perform undesirably in preserving community structure in the original graph,which leads to the incompleteness of sample graph in terms of community structure.To verify this empirical hypothesis and address the underperformance of graph sampling on community-imbalance graphs,this thesis conducts three aspects of research as follows:(1)A metric is proposed to measure the degree of community imbalanceness quantatively.First,the community partition of a graph is obtained by a modularity-based approach.Then,a pair of vectors are constructed to represent the community size distributions of the original graph and its corresponding community-balanced benchmark graph respectively.Finally,the distance divergence is involved to measure the difference between the pair of vectors.The further the distance,the higher the degree of community imbalanceness of the graph.Multiple sets of comparative experiments are conducted on simulated graph dataset with controlable imbalance communities to verify the effectiveness of the proposed measurement metric.(2)An experimental study is conducted to evaluate the impact of community imbalanceness on graph sampling.First,this thesis comprehensively selects and implements 16 graph sampling algorithms,4sampling evaluative indicators,and real-world network datasets.Then,three empirical hypotheses are proposed.H1: Sampling quality on community-imbalance graphs is worse than that of community-balanced graphs.H2: the higher the degree of community imbalance of a graph,the worse the sampling quality on it.H3: the above effect exists for all graph sampling algorithms and under all sampling rates.The experiment results show that H1 and H2 are confirmed while H3 is partially confirmed.Community imbalanceness exerts significant impact on single-seed traversal-based algorithms with a sampling rate not higher than 0.3.For random and multi-seeds sampling algorithms with any sampling rate higher than 0.3,there is no significant impact.(3)A new sampling method is proposed oriented to unbalanced graphs to improve the undesirable underperformance of existing algorithms.In our newly-proposed method:(a)Centers of all communities are identified as the sampling seeds.(b)Asynchronous multi-way random-walk is invoked to sample nodes,which allocates the access probability of each node based on bridging centrality,to enable a biased random-walker and lead across-over-communities traversal paths to break through bottlenecks between communities.(c)Induced subgraph of sampled node set on the original graph is output as the ultimate sample graph.The new sampling method can be directly embedded into most of the existing graph sampling algorithms without modifying the core parts of the original algorithm.Experiment results show that the new sampling method can effectively improve the underperformance of existing algorithms on community-imbalance graphs. |