Font Size: a A A

Research And Application Of Parallelization Of Community Discovery Algorithm Based On Spark

Posted on:2021-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z LiuFull Text:PDF
GTID:2370330620961347Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The family market is a key competitive market in the communication industry.The operators urgently need a family relationship recognition model,which can accurately identify the family in the massive user history call records.With the rapid popularization of smartphones,the calling social network not only becomes the largest social network,but also maps the intimate relationship between different users in the real world,so the calling social network presents a certain community structure.Because of this feature,this thesis proposes to use the community discovery algorithm to construct the family relationship recognition model on the calling social network.Considering factors such as time and module degree,this thesis chooses the Louvain algorithm as the community discovery algorithm of the family relationship recognition model.At present,the scale of social networks in the real world has already reached 100 million,which brings severe computing challenges to the establishment of the family relationship recognition model.Due to the call-data-set shows the characteristics of mesh type figure structure,and the Spark distributed parallel computing platform provides the GraphX components for diagram analysis and calculation,so this article on the Spark platform build relationships recognition model and based on the GraphX Louvain algorithm parallelization,main work and innovation points include the following parts:1.Realize the parallelization of the Louvain algorithm based on GraphX.This thesis analyzes the basic principle of the Louvain algorithm,completes the core calculation steps of the Louvain algorithm through the sending and aggregation message mechanism of GraphX,and realizes the parallelization of the Louvain algorithm on GraphX.To solve the problem of message lag after parallelization,the PLL(Process Lock Louvain)algorithm based on Process Lock idea and CGL(Connected Graph Louvain)algorithm based on Connected Graph idea are implemented in this thesis.Through the experimental analysis,the CGL algorithm based on the idea of the Connected Graph is finally selected by comprehensively considering the running time and result accuracy of large-scale network community discovery.2.Improve the CGL algorithm.In the process of dividing the community by the community discovery algorithm,each community has a unique community tag representing the current community,each member has a unique member tag representing the current node,and the community tag is selected from the member tag within the community.However,in the application of parallelization,it is found that some community tags do not come from community member tags,resulting in some nodes being misallocated to the same community.In this thesis,the CGL algorithm is improved by adding the step of community tag redistribution,and ICGL(Improve Connected Graph Louvain)algorithm is proposed.Through the analysis of the experimental results,it is found that the ICGL algorithm is more modular than the CGL algorithm when the time difference between the ICGL algorithm and the CGL algorithm is not significant.Besides,the ICGL algorithm is derived from the Louvain algorithm,which is a bottom-up clustering algorithm for aggregation levels.In the process of aggregation,the result resolution found by the community is insufficient.Therefore,this thesis proposes the definition of a super-large community,the definition of community decomposition,and design and implementation of a multi-super-large community parallel decomposition algorithm based on the ICGL algorithm.The experimental results show that the parallel decomposition algorithm can quickly obtain community discovery results of different resolutions,and the algorithm can improve the accuracy of community segmentation results to some extent.3.Building a family relationship recognition model based on the Spark distributed parallel computing platform.Based on the call data onto a city,the ICGL algorithm and community parallel decomposition algorithm is used.This thesis proposes to build a family relationship recognition model based on a family intimacy network and a family relationship recognition model based on call times network according to characteristics such as call time period,call times,and user attributes.Comparing the two kinds of family relationships recognition model precision rate and recall rate,operation time and degree of the module,the experimental results show that the proposed family relationships based on intimacy family network identification model has more accurate recognition ability,and family relations further proves that the proposed ICGL community discovery algorithm and parallel decomposition algorithm has important theoretical significance and practical application value.
Keywords/Search Tags:Family relationship recognition model, Calling social network, Community discovery, Parallelization, Connected Graph
PDF Full Text Request
Related items