Font Size: a A A

Research Of Public Opinion Information Mining On Bulletin Board Systems Based On Cluster Analysis

Posted on:2011-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:D L XuFull Text:PDF
GTID:2178330338980957Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of Internet and the popularity of BBS(Bulletin Board System), BBS has become an important platform for expression of public sentiment and provided a space of freely opinions exchanging for majority of Internet users. However, there are some filthy language, abuse and other uncivilized behavior, and even some expression of anti-government and disrupting society. In order to correctly guide public opinion, clean network environment, BBS public opinion monitoring technology came into being, which is an effective management tool for government and help the government master the hot topics of public concern each period. What's more, it can help governments understand the public's views on these hot topics and attitude to make the correct and scientific decision.The main content of this paper is as follows:First, some clustering methods commonly used and the evaluation criteria of clustering algorithm are described, and experiments have been conducted to evaluate the performance of two typical algorithms which belong to partitioning clustering. This dissertation mainly studies the application of K-means algorithm and K-medoids algorithm in text mining. Experiments have been conducted to evaluate the performance of the algorithms in accuracy rate and the recall rate based on artificial appraisable standard. Experiment results show that K-medoids algorithm is 5 percents higher than K-means algorithm in terms of accuracy and the recall rate, and the former is more robust in dealing with abnormal and noise data. K-medoids algorithm is then improved. As the repeated calculation of the sum of distance, K-medoids algorithm is time and memory costly. To solve this problem, before the clustering, the similarity between objects is pre-calculated and the similarity matrix is established. The time and memory cost for the calculation of distance in category can be greatly reduced by inquiring the similarity matrix.Second, this dissertation describes how to convert unstructured documents to structured BBS text and BBS text preprocessing which includes Chinese lexical analysis, stop word filtering, the feature representation of the text, the feature selection of the text, and weight calculation. These are the foundation of the system design in chapter V.Finally, base on the work of the preceding chapters, the BBS Hot Topic Mining System is implemented. The topics in BBS are converted to structured text with a crawler program and pretreatment of text. Then the K-medoids algorithm is used to identify topics. Evaluation function is finally used to rank the topics based on the minimum cost of clustering and the top ten hot topics are selected.
Keywords/Search Tags:data mining, Bulletin Board System, text clustering, K-medoids
PDF Full Text Request
Related items