Font Size: a A A

Microblog Topic Detection And Sentiment Analysis Based On Distributed Representation

Posted on:2017-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y T YangFull Text:PDF
GTID:2308330491454676Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Today, social media plays a more and more important role in People’s Daily life. People publish a variety of information through social media, and exchang opinions on social events. Dissemination of information among these users generates a large amount of text data, which attracted a vast number of researchers to explore the developing laws of public opinion as well as hot issues. This article studies on Sina microblog, which rapidly developed in recent years, and obtain the text data from Sina API. Considering the problems such as the high-dimensional matrix sparsity, ignoring semantics、grammar and word order in the traditional text representation, this paper combines the distributed representations and traditional methods, and create a microblog topic detection method. Then, tried to bring a novel and efficient method in microblog sentiment polarity classification.This paper is consists of two main aspects:microblog topic detection and sentiment classification. A topic detection method was proposed in this paper is based on the words distributed representation--"Word2vec", and was combined with traditional weight calculation method TF-IDF. In this way, we would transfer each microblog to a text vector, then use K-means clustering algorithm to finish the topic clustering. This article detected the related topics which were discussed by users in Sina Microblog through the method mentioned above, and demonstrated the feasibility and accuracy of the method by experiments.After completing the topic detection, this paper labelled the microblog text which was related to the topic and expressed the obvious sentiment polarity of the user, and bring in a method based on the document distributed representation--"Doc2vec", to transferred the texts to fixed-length vectors. Doc2vec had never been applied in Chinese text sentiment classification in previous studies, Finally, using support vector machine (SVM) classifier in sentiment classification, and the ten-fold cross-validation to evaluate the classification accuracy.In this paper, the clustering and classification experimental results verified the outstanding performance of the word and document distributed representation based methods:the accuracy achieved 80.06% and 90.35%. Compared to other text representation methods, they can solve the shortcomings such as high-dimensional sparse matrix, ignoring semantics, syntax, context and emotional information. These methods can transfer texts to fixed-dimension vectors more precisely and efficiently, and facilitate doing other text mining researches. Meanwhile, the paper concluded some empirical values in the experiment when using these methods, including the size of the training corpus and features dimension settings, etc., that can provide a reference for future research.
Keywords/Search Tags:distributed representation, microblog, topic detection, sentiment analysis
PDF Full Text Request
Related items