Font Size: a A A

Design And Implementation Of Weibo Data Mining System Based On Hadoop Platform

Posted on:2019-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:2428330551960312Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Big data,at the new stage of informatization development,have been applied into governmental public service,finance,court decision,medicine,game,tourism and any other hot industries widely.As a software framework for distributed processing of big data,Hadoop computing platform could help users easily frame and develop distributed programs so as to achieve the high-speed computation and storage of big data.Weibo generates billions of Weibo text data each day,which is the most valuable source of big data.Therefore,this thesis chose Weibo data as the research object,designing to achieve an automatic Weibo data mining system based on Hadoop platform,and the thesis also analyzes vast Weibo data to acquire the valuable information hiding behind of it.The main research contents of the thesis are as follows:(1)Data collection.The thesis redacted distributed a concurrent frame according to the module of the producers and consumers.And the thesis also designed to achieve Weibo data collection system based on Python,and this system deployed on Linux system,which carry on the real-time collection for the recent original Weibo data from a mass of high-quality Weibo users.Through the preprocessing of data,it could be saved to the native each several time in the form of text file.(2)Data storage.It based on batch data preserved at native,utilizing distributed file system HDFS and data warehouse Hive,designing data district-divided form,redacting Linux users' timed task,establishing data part automatically,redacting script file and uploading Weibo data saved at native to Hadoop cluster automatically,importing into the corresponding part in Hive storage according to data,and it is put into the Kafka information queue storage,based on real-time collection data.(3)Data analyzing.Measuring the heat of a Weibo based on the number of times for comments,reposts and giving like to it.To analyze and find out the hot Weibo,the Weibo data in HDFS is batch processed by Hive-SQL and the Weibo data in Kafka information queue is real-time processed by Spark Streaming.Basing on LDA theme model arithmetic,it finds out the hot topic peoples talking about at present time.(4)Result showing.With the help of Kdp-report system,it designs the report structure,and extracting data analysis result from Hive data warehouse.The result manifested in webpage visual form,so that users could browse hot Weibo and hot topics in time.
Keywords/Search Tags:Weibo, Hadoop, Hive, Spark, LDA
PDF Full Text Request
Related items