Font Size: a A A

Social Network Data Analytic Platform And User Retweet Behavior Analysis

Posted on:2016-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:K DengFull Text:PDF
GTID:2348330488474022Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the rapid development of the Internet nowadays, the social network generates a lot of data every second, just take Sina Weibo as example, users will produce more than one hundred million data every day. In the context of such a large amount of data, a single processing capacity is unable to satisfy the information processing capacity, and therefore the big data technology appears. In the background of large data, machine learning is also faced with the problem that a large sample data needed to be trained, so the machine learning algorithm also needs being for large-scale parallelized.In the light of the research on BDAS software architecture, this thesis proposes and implements a big data analysis platform based on Spark. With the Spark as core, the platform consists of data crawling, data processing, data mining and data visualization. Furthermore, in order to verify the platform scalability and mining more value in the data, this thesis conducts a research on predicting user retweet behavior in Weibo, moreover, the algorithm is implemented on the platform.The social network data analysis platform can be divided into four parts. Firstly, data crawling adopts the distributed communication framework Akka to implement a distributed Sina weibo crawler system, which provides massive data. Secondly, data preprocessing and storage provides a distributed storage service, based on the Hadoop distributed file system, which provides the storage, access and fault tolerance of massive data. Thirdly, the data mining and analysis is implemented by using MLlib, Graph X and tools which are implemented by us. With the fast data processing capability of Spark, this module achieves a capability of rapid processing, fast analysis and mining. Forthly, the data visualization uses Tomcat server and Redis cache to obtain the data and analysis results from the bottom of the platform. With the D3.js visualizing tool, the visualization of related results can be showed on webpages.As the algorithm on social network big data analysis platform, retweet behavior predicting algorithm is proposed in this thesis. In the algorithm, the multi-task learning framework is introduced to avoid the homogeneity problem in the traditional prediction model. After the feature selection and extraction for the retweet behavior data, the proposed algorithm is implanted based on the social network data analysis platform. Moreover the proposed algorithm is compared with the logical regression(LR), support vector machine(SVM) and Passive-Aggressive(PA) algorithm.This thesis presents the design and implementation scheme of the large data analysis platform based on Spark, and conducts the research of the user retweet behavior prediction algorithm based on the platform. In theory, the research on data analysis platform design and user retweet behavior prediction has reference value; in practice, the data analysis platform and user behavior prediction algorithm achieve the exploration significance.
Keywords/Search Tags:Big Data, Spark, Social Network, Distributed Web Crawler, Retweet Behavior, Multi-task Learning, Data Mining
PDF Full Text Request
Related items