Font Size: a A A

Research Of Clustering Approaches On Distributed Online Public Opinion Based On Mapreduce

Posted on:2016-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhaoFull Text:PDF
GTID:2308330467472620Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, as a new media, the network becomes the main channel that public people get and release information. Since network is convenient, free, virtual, open and osmotic, some social events often evolved into hot issues of public opinion, then, it may have a bad effect on the stability of public security. Therefore, it is essential to cluster the network public opinion fast, find hot issues in time and then make timely monitoring and effective guidance, which is of great significance to maintain social stability and guarantee information security.Meanwhile, the public opinion information on the Internet is massive, and presents a rapid and sustained growth trend. The traditional clustering method is difficult to be used to process large-scale network data in a centralized way because of high time and space complexity, low efficiency, insufficient memory and other shortcomings. To solve the problem, in this article, we make deep research on parallel clustering algorithms and proposed a novel approach based on MapReduce. Experiments show that the parallelization could solve the problem discussed previously effectively. Since most of the network public opinion information exists in text form, this paper will mainly research on the parallel network opinion text clustering. The main contributions of this article are summarized as follows:First, the Birch algorithm is improved. This paper selects the Birch algorithm as the aimed public opinion text clustering method. By analyzing the deficiency of Birch algorithm, we put forward an improved Birch algorithm. The improvements mainly include three aspects:(1) We put forward a reasonable method to detect and remove outliers; a mechanism for setting the dynamic parameters is also presented.(2) We optimize the construct process of a new CF-tree based on rebuilding theory and propose a continuous optimization scheme, instead of the triggered discrete optimization scheme of Birch algorithm.(3) A series of experiments are conducted to compare the improved Birch algorithm with the traditional Birch algorithm and the simulation results show that when select an appropriate noise detection function parameter and mini cluster expansion coefficient, the improved Birch algorithm outperforms the traditional Birch algorithm on both clustering effect and runtime efficiency. Second, combined with the distributed parallel computing framework MapReduce under the Hadoop project, we make a parallel design and implementation for each stage of text processing and text clustering, and evaluate the performace of parallel public opinion text Birch clustering for text classification corpus provided by the sohu (Chinese web portal) from speedup, efficiency and scalability analysis by using Java programming. The experiments indirectly show that parallel text clustering algorithm is superior to the traditional serial text clustering algorithm in clustering performance and efficiency, greatly improves the efficiency and reduces time of data processing.In conclusion, this paper takes the network public opinion texts as the research object, emphases on how to implement the parallel clustering of network public opinion texts based on MapReduce programming model. The main work of this paper is to study clustering algorithms and combine the improved Birch clustering algorithm with MapReduce, making the parallel network public opinion text clustering come true and concluding that the parallel public opinion text clustering is superior to the traditional public opinion text clustering in two aspects of effect and effectiveness,and it is more suitable for clustering large-scale network public opinion information. Moreover, classifying the network public opinion information quickly and efficiently in different situation will lay a solid theoretical basis for deeper trend forecasting, topic discovery, hot tracking, monitoring and other network public opinion related researches.
Keywords/Search Tags:network public opinion, TF-IDF, Birch algorithm, MapReduce, paralleltext clustering
PDF Full Text Request
Related items