Font Size: a A A

Research And Implementation Of Community Detection Algorithm Oriented Web-Scale Graph Data On Hadoop Platform

Posted on:2016-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y C WangFull Text:PDF
GTID:2428330542989576Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data sets originating from many different real world domains can be represented in the form of interaction networks in a very natural,concise,and meaningful fashion.Why is it that?This is particularly true in the social context,especially given recent advances in Internet technologies and Web 2.0 applications leading to a diverse range of evolving social networks.Analysis of such networks can result in the discovery of important patterns and potentially shed light on important properties governing the growth of such networks.Social networks applications have stored plenty of relations network data(also called graph data).These data is very important and valuable for community detection.Community detection and analysis is an important methodology which is partitioning a large scale social network into many smaller partitions(also called community)for analyzing a complex and massive network to understand the relationship of a real-world network.This methodology has been applied in a lot of fields.The following is some good examples.For example,this methodology can be used to recommend some themes like food,clothes,books and so on to you while you are scanning social networks.Another applied area of community detection is the field of information biology.We can use this technology to analyze a gene regulatory network.We also can use this methodology to recommend some news and goods for you according your browsing habits.Meanwhile we can use this technology to analyze an epidemic transmission network to predict the spread of the epidemic and find out the critical epidemic community as well as the crucial transmission source nodes in the epidemic transmission network.Especially in recent years plenty of data have been stored in a lot of servers in all fields with the strong rise of social networks applications.The scale of these data which have been stored in a lot of servers is very large.We can not use these data directly.The only way which we can use to analyze these data is community detection methodology.These data can be used to serve economy development and society improvement by community detection technology.As a consequence,there are important academic meaning and applying value in the research on the theme of community detection.In recent years,many in-memory community detection algorithms have been proposed for graphs with millions of edges.Due to this kind of algorithms is only for graphs with millions of edges,the common point of this kind of algorithms is that they are all designed for in-memory and single node to analyze large scale graph data.However,with the coming of the era of big data and the growth of graph data,much and much graph data have been accumulated and stored so that emerging massive graph data with billion edges.For this large scale graph data,using the above mentioned algorithms which are designed for running in memory and a single node to analyze very large scale graph data is impossible to reach your goal.Fortunately,with the rise of distributed computing technology,distributed computing technology is very mature until today and emerging the representative open source distributed computing platform Hadoop.A mature distributed computing platform like Hadoop makes it possible for us to process and analyze massive data in the era of big data.Our approach is using the distributed algorithm on the Hadoop to process and analyze the very large graph data with billions edges.In this thesis,we show how to find community partitions of networks with billions of edges.Our approach is based on an ensemble learning scheme for community detection that provides a way to identify high quality partitions from an ensemble of partitions with lower quality-We present a pre-processing procedure for community detection algorithms that significantly decreases the problem size.After reducing the problem size,traditional non-distributed community detection algorithms can be applied.We implemented a weak but highly scalable label propagation algorithm on top of the distributed-computing framework Apache Hadoop.The evaluation of our implementation on Hadoop cluster and with evaluation datasets up to 3.3 billion edges shows very good results with respect to clustering quality as well as scalability.For a smaller 260 million edges network,we show that our pre-processing can improve the results of the popular Louvain modularity clustering algorithm.In the procedure of research on community detection,many scientists have proposed plenty of excellent and classic algorithms.In this paper,we also choose a classic community detection algorithm:GN algorithm to analyze and implement.
Keywords/Search Tags:community detection, distributed algorithm, Web-scale graph data, MapReduce, GN algorithm
PDF Full Text Request
Related items