Font Size: a A A

Research On Spark Based Public Opinion Analysis Architecture

Posted on:2017-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z L TanFull Text:PDF
GTID:2308330485469655Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Public opinion analysis generates brief reports, charts and other analysis results through automatically capturing, classification and clustering the massive information in the Internet. The analysis results provide analytical support to decision makers who can have complete control of the people’s ideological trends, and make the correct guidance of public opinions. With the popularity and application of mobile Internet, e-commerce, social networking and other emerging technologies. The number of Internet users grows explosively. Therefore, the efficient processing architecture which can deal with massive data is of great significance to public opinion analysis.Based on the micro-blog data in the Sina website, this thesis combines the large data processing technology and discusses the feasibility of building public opinion analysis framework based on the Spark technique. The main contributions of this thesis are as follows. Firstly, design the overall structure and build Hadoop distributed platform for data storage and processing. On the basis of mass data storage, the data retrieval and reading and writing performance are improved by combining the HBase and Lucene technologies. Secondly, a highly efficient and stable data acquisition scheme is designed which can overcome the defects of the existing simulated login technique and API acquisition program. The Redis technique is used to control the waiting queue, updating queue and crawled collection, so that the repeating collection can be avoided and updating of data in time. It presents the mobility of the agent pool mechanism to solve the problem of IP restrictions. The IP agents in the pool are constantly updated, so that they can play their biggest function for different web pages. It ensures the continuity and stability of data capture, which improves the collection efficiency. Thirdly, in order to overcome the bottleneck of Hadoop in the text clustering, this thesis uses the improved K-means algorithm based on Spark in the text clustering module. Aiming at the disadvantage of micro-blog features, the word2vec tool is used to extend the feature item in the preprocessing stage; the optimization of K-means algorithm is mainly based on the K value selection and cluster initialization. The optimized K-means is then processed in parallel with the text data on the Spark framework, so that the architecture can be more efficient in data processing and analysis. Finally, a bottleneck detection method based on resource information gain is proposed for the platform. The method finds the bottleneck by monitoring cluster’s Response Satisfaction (RS). The specific bottleneck resource is identified by calculating the information gain of all kinds of resources, which use the resource usage rate and the response satisfaction samples.Public opinion analysis, as a powerful force of social construction, has a great prospect in application and research. Therefore, it is imperative to study the structure of public opinion analysis. Experimental results show that the public opinion analysis framework proposed in this thesis can well adapt the massive data in public opinion analysis and achieve good results in data acquisition and data processing. It is feasible to deal with the large-scale public opinion analysis data.
Keywords/Search Tags:Public Opinion Analysis, Data Acquisition, Spark, Text Clustering, Bottleneck Detection
PDF Full Text Request
Related items