Hot Topic Extraction From Microblogs

Posted on:2015-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhu

Full Text:PDF

GTID:2268330428980402

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the Internet data and resources increase massively. In order to effectively manage and utilize this kind of information, content-based information retrieval and data mining have become hot research topics. Latent semantic analysis, which computes topic distribution for the document and word distribution for the topic based on known word distribution for the document in order to get latent topics, plays an important role in information retrieval and text mining and is applied extensively in the field of text classification and clustering, information organization and management and hot topic extraction.In recent years, with the rise of Web2.0, social networks, such as Renren, Facebook, Twitter, Sina Weibo, etc. not only become very popular, but also become a way of modern life. Widely used social media produces massive user-generated content data (UGC), of which more than80%is natural language text. It is observed that detecting hot topics can be very helpful and necessary for people to get essential information quickly. However, these texts are special and bear their own characteristics, many traditional topic analysis model cannot achieve better results unless augmented with new features. Texts from social networks have four salient:high-dimensional, sparse, not normative and uneven distribution of topics. In other words, large numbers of messages are posted in every minute and these texts likely produce the vector with more than ten thousands of dimensions, which is too time-consuming for the topic extraction; compared with the long texts, these texts have even less keywords, producing the sparse "document-word" matrix and thus difficult to extract the effective features and to exploit the correlation between the features; abbreviations and catchwords are used extensively in social network, increasing the synonyms in the texts and making the topic identification task more difficult to handle; in addition, fewer messages on microblogs are valuable for hot topic detection due to massive is about users’daily life, such as weather, foods, emotions, and so on. So whether a term is hot or not is not based on its frequency of occurrence. We aims at handling the problems in hot topic detection on microblogs. On one hand, traditional topic analysis methods cannot identify hot terms efficiently, resulting a lower accuracy; on the other hand, large amount of text makes the classification algorithm less efficient. Recently, there are many efforts focused on improving accuracy and efficiency in hot topic extraction. One is to use external knowledge to enrich semantic information. It is known that the external corpus greatly influences the final result, however, how to select an appropriate external knowledge is nontrivial. The other is to utilize the attributes of microblogs, such as posted time, tags and repost count and comment count to help maintain users’ information. In this thesis, we investigate these problems on the basis of LDA model. Our contributions are as follows:1) In order to enrich information of a document, we first combine similar messages of a user together with a entity based cosine algorithm. Then hot topics are extracted by LDA model based on two or more external datasets.2) We propose a multi-attribute Latent Dirichlet Allocation (MA-LDA), a topic analysis model in which the time and tag attributes of microblogs are incorporated into LDA model. By introducing a time variable about time attribute, MA-LDA model can decide whether a word should appear in hot topics or not. Applying tag attribute allows MA-LDA model to rank the core words high in results so that the expressiveness of outcomes can be improved.The experimental results show that both1) and2) can significantly improve the performance of hot topic extraction in the context of microblogs.

Keywords/Search Tags:

hot topic extraction, topic model, LDA, latent semantic analysis, classification

PDF Full Text Request

Related items

1	Research On Topic Modeling Method Based On Semantic Distribution Similarity
2	Topic Discovery And Trend Analysis In Scientific Literature Based On Topic Model
3	Topic Analysis And Recommendation System Based On Scientific Research Documents
4	Research On Topic Discovery Method For Social Network
5	Research And Application Of Text Classification Model Based On Topic Model
6	Research On Rough Classification Of Academic Papers Based On Topic And Semantic Fingerprint Fusion
7	Topic Extraction Algorithm Based On NP-Chunking And Phrase Weight Calculation
8	Analysis Model Of Medical Text And Image Based On LDA And LSA And Its Application
9	Research And Application Of Topic Evolution Model Based On LDA
10	Hot Topic Detection Strategy Of Micro-blog Based On Latent Semantic Analysis