Microblog as a social platform to share and exchange of information,since 2009,the domestic company Sina launched microblog platform has been rapid development and wide application.As of September 30,2016,sina microblog monthly active users has reached 297 million.Microblog information with interactive information is simple and quick,anytime,anywhere dissemination of information,information release threshold is low,the mode of transmission was fission and so on.Microblog is a news release platform,news and interactive information platform.In people’s access to information,information transmission,retrieval of information and other daily network behavior plays an increasingly important role.In contrast,the analysis and mining of microblog information is still at the initial stage,the microblog information is short,massive,non-standard features,the traditional method of analysis is difficult to meet the demand.To this end,this paper introduces a text clustering method,in view of the characteristics of microblog comment information and microblog hot events based on a large number of user comments to explore a set of text clustering based on microblog comments information processing process.The purpose is to close the contents of similar comments clustered to understand the social point of view of hot events,to conduct effective public opinion analysis and testing for specific events but also to enable the leadership to better understand public opinion,contribute to Decision-making reform.The major work presented in the thesis is as follows.First of all,this paper analyzes the characteristics of micro-blog text information,studies the commonly used methods of text information analysis,and expounds the clustering analysis technology,including the definition,form and similarity measurement method of clustering.Secondly,according to the characteristics of microblog information and the way of information processing,this paper analyzes the clustering steps of microblog’s comment information,including text preprocessing,microblog text representation and clustering analysis.In text preprocessing,discusses Chinese segmentation,stop word filtering and denoising in text,text representation stage,discusses various text representation methods and feature weights,in the phase of text clustering,clustering analysis of different methods and describes a variety of algorithms.Through the above discussion and analysis,this paper determines the specific methods used in this paper.Then use the R software for text denoising and through the jiebaR package to complete the Chinese word segmentation,stop word filtering and other pre-processing work.After analyzing and comparing several kinds of text representation methods,this paper uses vector space model to express the microblog comment text.However,the k-means algorithm is widely used in the selection of clustering algorithm,but considering that the k-means algorithm is sensitive to the initial points and outliers,the K value needs to be set manually,which increases the k-medoids algorithm.This is because the k-medoids algorithm is similar to the k-means algorithm,but it is robust to outliers,and the K value in the pamk function of the R software does not need to be set manually.In the process of implementing the algorithm,the influence of the K value and the initial point on the clustering result is analyzed,and the way to realize the k-medoids algorithm and K-means algorithm in R language is discussed.Using the word cloud and word item network to visualize the microblog review information.Through the experiment,it is found that the selection of different random seeds has little effect on the clustering results,because the amount of data is not large,there is no significant difference in the running time of the algorithm.When the system clustering method is used to cluster the feature term,the result of the system clustering is better than the square deviation method and the maximum distance method.The results obtained by k-medoids cluster analysis show that the number of clusters is 2,but the average shadow value is about 0.69.In this paper,based on the dictionary based word segmentation method and the space vector model,the semantic relation between the feature terms is weak,so the clustering results are not reasonable enough. |