Font Size: a A A

Research On Topic Modeling Over Heterogeneous Information Networks

Posted on:2015-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2268330431956294Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Web applications, there are plenty of objects interacting with each other, forming large, interconnected and complicated networks. We call such networks as information networks. In reality, information network is ubiquitous, and has become an important part of modern information infrastructure. In order to better understand the characteristics of information networks, information network analysis technologies have caused people’s attention, and now are widely applied in data mining and data analysis.When an information network contains a single type of object, we call it homogeneous information network. For example, in DBLP co-author network, the nodes only represent author and the links only represent the relationship between authors. When an information network contains a multiple types of objects or links, this network is defined as heterogeneous information network. For example, in DBLP bibliographic network, there are three types of objects:papers, authors and venues, and two types of relationships:paper-author and paper-venue. So far, there are many influential algorithms and applications in the studies of homogeneous information network analysis, such as PageRank, HITS, and community discovery. However, networks in the real world are mostly heterogeneous information networks where the complex links between different types of objects reveal more important semantic information. Thus, the research on heterogeneous information networks has become a hot pot in data mining.Topic modeling is important for document analysis that can find the latent topics hidden in the documents and it has been widely applied in machine learning, natural language processing, and other fields. Recent years, textual documents, such as web pages, papers, blogs, become richer, and are ubiquitously interconnected with each other as well as other objects(e.g., users), forming various heterogeneous information networks. In the heterogeneous information networks, links among objects include rich semantics of networks, and the object itself also contains rich text content, so it is necessary to research on topic modeling over heterogeneous information networks. Most of topic models only consider homogeneous information networks, and the research on topic modeling on heterogeneous information networks is rare.In this paper, we research on the problem of topic modeling on heterogeneous information networks. Fist, we propose a propagation-based topic model using latent semantic analysis LSA-PTM, which integrates heterogeneous information network and textual documents into topic modeling. On the basis of LSA-PTM, we further consider the intrinsic topic consistency between heterogeneous information networks and textual documents, and we put forward an optimized topic model cluTM. The main contributions of this thesis are as follows:1. Propose a propagation-based topic model using latent semantic analysis on heterogeneous information networks LSA-PTM. We introduce a topic propagation method based on the links between different objects that integrate heterogeneous information networks into topic modeling, which enhances topic modeling results. To better understand the meaning of each topic, a topic description is computed for each topic. Experiments are conducted on real DBLP dataset, and the results prove LSA-PTM is better than other topic models。2. Propose a unified topic model integrating content and link cluTM. cluTM directly combines the content of textual documents and links in a unified framework by joint matrix factorization on both the document-phrase matrix and link matrices using latent semantic analysis. Further analysis is performed based on the compact representation, which can discover the latent topics and identify clusters of multi-typed objects simultaneously. We apply cluTM on DBLP dataset, and experimental results prove that cluTM is more effective than LSA-PTM.
Keywords/Search Tags:Topic Modeling, Heterogeneous Information Network, Latent SemanticAnalysis
PDF Full Text Request
Related items