Font Size: a A A

Hot Topic Extraction From Microblogs

Posted on:2015-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2268330428980402Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the Internet data and resources increase massively. In order to effectively manage and utilize this kind of information, content-based information retrieval and data mining have become hot research topics. Latent semantic analysis, which computes topic distribution for the document and word distribution for the topic based on known word distribution for the document in order to get latent topics, plays an important role in information retrieval and text mining and is applied extensively in the field of text classification and clustering, information organization and management and hot topic extraction.In recent years, with the rise of Web2.0, social networks, such as Renren, Facebook, Twitter, Sina Weibo, etc. not only become very popular, but also become a way of modern life. Widely used social media produces massive user-generated content data (UGC), of which more than80%is natural language text. It is observed that detecting hot topics can be very helpful and necessary for people to get essential information quickly. However, these texts are special and bear their own characteristics, many traditional topic analysis model cannot achieve better results unless augmented with new features. Texts from social networks have four salient:high-dimensional, sparse, not normative and uneven distribution of topics. In other words, large numbers of messages are posted in every minute and these texts likely produce the vector with more than ten thousands of dimensions, which is too time-consuming for the topic extraction; compared with the long texts, these texts have even less keywords, producing the sparse "document-word" matrix and thus difficult to extract the effective features and to exploit the correlation between the features; abbreviations and catchwords are used extensively in social network, increasing the synonyms in the texts and making the topic identification task more difficult to handle; in addition, fewer messages on microblogs are valuable for hot topic detection due to massive is about users’daily life, such as weather, foods, emotions, and so on. So whether a term is hot or not is not based on its frequency of occurrence. We aims at handling the problems in hot topic detection on microblogs. On one hand, traditional topic analysis methods cannot identify hot terms efficiently, resulting a lower accuracy; on the other hand, large amount of text makes the classification algorithm less efficient. Recently, there are many efforts focused on improving accuracy and efficiency in hot topic extraction. One is to use external knowledge to enrich semantic information. It is known that the external corpus greatly influences the final result, however, how to select an appropriate external knowledge is nontrivial. The other is to utilize the attributes of microblogs, such as posted time, tags and repost count and comment count to help maintain users’ information. In this thesis, we investigate these problems on the basis of LDA model. Our contributions are as follows:1) In order to enrich information of a document, we first combine similar messages of a user together with a entity based cosine algorithm. Then hot topics are extracted by LDA model based on two or more external datasets.2) We propose a multi-attribute Latent Dirichlet Allocation (MA-LDA), a topic analysis model in which the time and tag attributes of microblogs are incorporated into LDA model. By introducing a time variable about time attribute, MA-LDA model can decide whether a word should appear in hot topics or not. Applying tag attribute allows MA-LDA model to rank the core words high in results so that the expressiveness of outcomes can be improved.The experimental results show that both1) and2) can significantly improve the performance of hot topic extraction in the context of microblogs.
Keywords/Search Tags:hot topic extraction, topic model, LDA, latent semantic analysis, classification
PDF Full Text Request
Related items