Font Size: a A A

The Research And Design Of Public Sentiment Publishing Platform Based On Hadoop

Posted on:2016-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y T ZhanFull Text:PDF
GTID:2308330464469020Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The Internet carries a variety of information.Compared to the structure of traditional text document, the web page is more complex. Facing the high transparency of the media, every big website can expose the happening event. People can express their views and opinions by follow poster, micro letter, forum and other network media.Simple manual methods and traditional tools of public opinion analysis have not unearthed the valuable and potential information on the web page in a limited time, effort and expense, and can not keep up with the pace of developing times.In order to solve the contradiction which we can not quickly analyzing the valuable information on the network,and combined with the actual demands of Shandong Academy of Sciences,we put forward a project which we can use the low-cost, high-scalability, high-performance Hadoop technology process mass data quickly and develop the public opinion dissemination system. Our research contents are monitoring and processing thousands of information on the Internet, finding out the related news on the network with mining analysis, using a removal technology for duplicated texts and finally showing the results of public opinion analysis with a block form so as to facilitate relevant departments know the status of the last stage of public opinion, warn its negative news or unexpected events,and generate public opinion reports which can be sent to the relevant personnel providing reference basis for management decision-making. The system is built on the Hadoop clusters, and we can take advantage of distributed computing and storage capacity to process massive data efficiently.The innovation of this system lies in using Hadoop technology to analyze the public opinion analysis system, proposing a new LCS-h algorithm which can be applied to the duplicated text removal. Compared to the traditional deduplicated text removal algorithm, the LCS-h algorithm retains the word orders of the original article, adds word frequency as weights, and uses the Hamming distance measure the repeating words. Because the algorithm preserves the original order, the degree of accuracy in distinguishing the repetition is very high.The repeating identification of the articles which have similar length and contain a large number of repeated words is more accurate.After removing the duplicated text documents which are related to the webpage contents,we show the results with blocks and display the data of public opinion with charts.By this way,we can convenient the public opinion monitoring personnel understand the situation of public opinion clearly.The significance of this research lies in taking advantage of cheap machines to crawl web in a efficient and stable way, analyzing public opinion information, and showing up the results in a visual web form to interactive wih users. This system utilizes the Hadoop technology to manage the distributed file system, apply the Map Reduce programming model for developing parallel program, use the traditional data mining algorithms for analyzing,and show the results of the analysis of public opinion in a visual way so that the effective implementation of the system is verified.
Keywords/Search Tags:Public opinion analysis, Web data mining, text deduplication, Hadoop
PDF Full Text Request
Related items