Font Size: a A A

Research Of Topic Detection For Social Media Based On Word Embedding Model

Posted on:2017-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2348330503981839Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The twenty-first century is an era with rapid development of network and information technology. In recent years, with the popularity of mobile Internet and web 2.0 application, it rises lots of social medias such as micro-blog, blog, forum and so on, which make the common people to express their views more and more convenient on the web. Large numbers of online comments can reflect attitudes, opinions and requirements of the public for a period of time, timely and accurately grasp, deeply mining and analysis what the Internet users are discussing is extremely important. However, most of the current topic recognition work about social media are based on the attributes of data, they regard word as the basic feature, and then calculate probability of words according to word frequency, the semantic information are usually ignored. In this paper, we conduct our research on the social media dataset, and focus on the topic detection and analysis of its content by using topic models, the main work includes the following two parts:(1) In terms of the characteristics of social media datasets, the existing word vector models didn't consider the internal order relation of words, and they only used the local context to forcast the target word in each processing of training, which is insufficient to capture semantic knowledge. To overcome this problem, we propose a novel hybrid model called mixed word embedding(MWE), which considers both word order and mixed context information. This model is based on the well-known word2 vec toolbox, it combines the two variants of word2 vec, i.e., SKIP-GRAM and CBOW, in a seamless way via sharing a common encoding structure, which is able to capture the syntax and semsntic information of words more accurately; furthermore it incorporates the local and global context of the target word within a sliding window, while maintaining words order in each document, after training, we can get useful word embeddings with rich syntax and semantic information at the same time.(2) The existed probabilistic topic models regarded word as the basic unit and computed probability between words and topics by co-occurrence frequency, the semantic information was less considered, while social media usually contains large numbers of short text message, less useful word features and much noise data, which made it difficult to recognize and analyze topic directly in the specific social media topic detection tasks. In this article, we import an external expansion corpus as auxiliary information to the LDA model for better understanding words and their semantic expressions, meanwhile use the model which is proposed in(1) to get good word embeddings, then fed them into topic models for topic detection and analysis by redefining the probability conditional distribution of topic vectors and word embeddings. We minimize the KL divergence of the new topic-word distribution function and original's of LDA, to learn both word embeddings and topic model. The experimental results proved that this method performed better on word reprsentation and topic detection when compared with word2 vec and LDA model.
Keywords/Search Tags:Social media, topic detection, feature expression, word embeddings, topic model
PDF Full Text Request
Related items