Font Size: a A A

Research On Performance Optimization Of Community Detection In Cloud Computing Environments

Posted on:2019-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:F DingFull Text:PDF
GTID:1368330590466661Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The ongoing innovation of social networking service models enhance the functionality of social network applications.There are various applications of social networks in daily life such as instant messaging,news feeds,payment transactions and entertainments.The research results of community detection in social networks have been widely used for precise marketing,search engine development,online public opinion analysis,and so forth.However,it is rising sharply the number of users in social networking sites,as well as the scale of social networks.The social networks produce big data with large sample size,complex relationships,and high dimensions.It makes the performance of data analysis in large-scale social networks the key factor that determines whether the community detection results are valuable.Diverse data types in social networks make it more difficult to perform community detection upon heterogeneous data sources.The performance requirements for community detection in largescale social networks based on cloud computing are becoming more urgent.Huge number of users make it a performance bottleneck to perform statistical reasoning in distributed community detection.The frequently building of contingency tables caused by high-dimensional data seriously affects the efficiency of distributed community detection.Focusing on the problems in social big data analysis from the perspectives of variety,velocity,volume,and high dimensions,this thesis investigates how to utilise diverse social network data to improve the accuracy of community detection,how to utilise cloud computing with higher parallelism to detect communities,and how to improve the performance of distributed community detection with large-scale and high-dimensional data.The thesis is organized as follows:(1)We propose a social network community detection method based on probabilistic graph.We propose the probabilistic-graph-based community detection(PGCD)model,where the potential associations between user features' similarities are modelled by the feature similarity probabilistic graph(FSPG).Besides,the social circle set is introduced to describe which users and features are associated with certain social circles.The relation probabilities set describes the probability of forming an ego network given a group of users and social circles.We also present the FSPG learning algorithm and the ego network probabilistic information criterion as the objective function.We used real-life datasets from Facebook,Twitter and Google+ to conduct experiments.The experimental results indicate that the PGCD model trained by the FSPG learning algorithm could efficiently and accurately predict social circles.(2)We present a distributed community detection method based on Storm Topology.The Topology-based distributed community detection(TDCD)algorithm makes the scoring and expanding of a directed acyclic graph as the basic computing unit,so that the centralized community detection process is distributed at the lowest granularity to increase the parallelism of FSPG learning algorithm.An identifier automatic generation mechanism is designed to ensure that each probabilistic graph corresponds to a distinct computing node in the cluster,so that loops in the search space can be detected and handled.By using the hash value of each probabilistic graph as the key of a tuple,the calculation tasks can be assigned more efficiently to the computing nodes.We also analyse the parallelism of the algorithm.We configured a Storm cluster and conducted comprehensive performance testing experiments in the distributed environment.The results suggest that the algorithm could achieve high parallelism.Moreover,the performance improvement is more significant in a larger dataset.(3)We propose a performance optimization method based on sparse ADtree for distributively detecting communities.We present the sparse-ADtree-based sufficient statistic extraction(ADSSE)algorithm,where the sufficient statistics are stored in the sparse ADtrees and sent to each computing nodes.Hence,the frequent access to the distributed file system is avoided.We also propose the sparse-ADtreebased probabilistic graph scoring(ADPGS)algorithm.The ADPGS algorithm restores contingency tables from the sparse ADtrees to calculate the scores of probabilistic graphs.The duplicate calculations of contingency tables are reduced by maintaining local contingency table sets.We conducted performance testing experiments in distributed environment using ground-truth datasets.The experimental results show that the searching and scoring time for detecting communities is greatly reduced.In summary,the distributed process for learning model parameters is significantly accelerated.(4)We present a performance optimization method for distributively detecting communities upon high-dimensional data.We propose two effective data structures of contingency tables: list contingency table and hash contingency table.The contingency table structures are simplified by using onedimensional arrays and hash tables for storing conditional probabilities.Thus,the time of building and indexing contingency tables from high-dimensional data is reduced.The fast contingency table constructing(FCTC)algorithm is proposed to avoid the recursive calls in the building process.We solve the problem of recovering the MCV nodes from sparse ADtrees in a non-recursive approach so that the time for building contingency tables is substantially shortened.The comprehensive performance comparison experiments were conducted using a large number of random datasets as well as real-life datasets.The experimental results unfold that the time required to build the list contingency tables and hash contingency tables from high-dimensional data is shorter than the existing approaches.The FCTC algorithm further accelerates the process for calculating the scores of the probabilistic graphs.
Keywords/Search Tags:Community detection, Cloud computing, Probabilistic graphical model, Storm Topology, Sparse ADtree, Contingency table
PDF Full Text Request
Related items