Font Size: a A A

Analysis Of Micro-blog User Influences Based On Hadoop

Posted on:2019-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2428330542996910Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
With the increase of the number of Internet users and the growing influence of social media,micro-blog as a more influential social platform has attracted more and more public participation,especially the joining of industry celebrities and popular stars,which strengthens the social influence intensity and breadth of micro-blog.At present,the number of Internet users continue to increase,the Internet traffic surge in the Internet era,application of large data to obtain data,analysis of data has become one of the hot spots.There are amounts of original data in Sina micro-blog,therefore,analysis of the original data and found its potential value has been a foundation of learning more about users,which provides strong technical assistance for the enterprise's precise marketing and commercial promotion.This project collects micro-blog data sets and classifies users based on user influence,which provides theoretical basis for individual service.There are three modules in the project,including modules of data crawling,data import,and data analysis.The content of data crawling is to design the architecture of data crawling and collect user data and micro-blog data.The content of data import is to research data access to Hadoop cluster with multiple data source formats and structures,which could improve the robustness and efficiency of data acquisition.The content of data analysis is to extract user characters firstly,design the model of user influence,and get user clusters based on the algorithm of K-MEANS with three dimensions including CMI(the content's charm index of micro-blog),user behavior infactor,and transfer depth infactor.The main architecture of the collector is Scrapy framework based on the language of Python,which includes the module of proxy IP to solve the anti-climbing limit of target websites.The main architecture of data import module is to design data access of heterogeneous data source,using Sqoop,the language of Shell,Apache Flume to achieve data access of relational database,non-relational database and file system to HDFS or Hive.The main architecture of the data analysis is to extract user and micro-blog characters,including the number of fans,followers,likes,comments,forwarding,and the forward depth.Then get the clusters num K by implementing the computeCost interface and get the clustering results using K-MEANS core algorithms based on Spark MLlib.The main achievement of this paper is building a distributed data crawling and analysis platform,based on which designing data access procedures loading data to Hadoop clusters,and finally analysis the influence of sina micro-blog users.This paper provides methods for data crawling and access,and provides relative theoretical basis for business marketing.
Keywords/Search Tags:Hadoop, Distribution, Data Crawling, Data Access, User Clustering
PDF Full Text Request
Related items