Font Size: a A A

Hadoop Scheduler Optimization And Its Application In Public Opinion Analysis

Posted on:2016-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:X TangFull Text:PDF
GTID:2308330473962427Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Media such as news, microblogs and blogs are constantly generating massive data with sentiment orientation, and it is imperative to conduct Public Opinion analysis with data mining to help governments and enterprises supervise and manage people’s opinions. Here the efficiency of Public Opinion analysis over massive data is the key problem. Hadoop has become a popular framework for processing large datasets in parallel over a cluster with its high scalability, reliability and low cost. However, its performance may be degraded by excessive network traffic when processing jobs for such two problems as data locality in Reduce task scheduling and partitioning skew. Therefore, we optimize the Hadoop’s task scheduler, and then design and implement a Public Opinion analysis system on it. Meanwhile, an improved algorithm is proposed to solve the problems during Public Opinion analysis. Major contributions of this thesis are as follows:(1) The Minimum Transmission Cost Reduce task Scheduler (MTCRS) is proposed, and the Average Reservoir Sampling (ARS) algorithm is used for data sampling. Then an intermediate data transmission cost model is built to calculate the best launch location for Reduce task, whose parameters are related to the information obtained from sampled data processed by Map tasks. Extensive experiment results show that MTCRS can reduce network traffic by 8.4% compared with Fair scheduler.(2) A hybrid Public Opinion analysis approach based on mutual information and improved KMeans clustering is proposed. Firstly, stop-dictionary and Part-of-Speech tagging are used to reduce feature dimension. Then the density peak algorithm is combined with binary search to determine the cluster number K and initial centers for KMeans clustering. Finally, hot words for each cluster are extracted with mutual information, which is followed by sentiment analysis and trend analysis. Extensive experiments are conducted on Hadoop, and the precision rate, recall rate and F1-measure of clustering reach 87.52%、81.54% and 84.42% respectively. This hybrid Public Opinion Analysis approach could effectively mine hidden useful knowledge to help decision-making.(3) The Public Opinion analysis system is deployed on the optimized Hadoop platform. Each module is designed and implemented in the form of MapReduce job. There are three major modules, including data collection module, topic detection module and sentiment analysis module.
Keywords/Search Tags:Hadoop, Public Opinion, Task scheduler, Clustering, Sentiment analysis
PDF Full Text Request
Related items