Font Size: a A A

Topic models for link prediction in document networks

Posted on:2013-03-18Degree:Ph.DType:Thesis
University:The Pennsylvania State UniversityCandidate:Kataria, SaurabhFull Text:PDF
GTID:2458390008468803Subject:Information Technology
Abstract/Summary:
Recent explosive growth of interconnected document collections such as citation networks, network of web pages, content generated by crowd-sourcing in collaborative environments, etc., has posed several challenging problems for data mining and machine learning community. One central problem in the domain of document networks is that of link prediction among any two documents or document centric entities, such as authors, based upon already present links in a given network. The problem of link prediction in document networks is a fundamental problem. Several applications, such as recovering missing link among entities in a given network of documents, citation recommendation to research professionals, collaborator recommendations to authors, discovering influential authors or bloggers in research articles or web-logs respectively, studying ideas and opinion propagation in evolving collection of research documents or news media, disambiguating references of people mentioned in news articles, etc. can be cast as a particular flavour of link prediction problem to be solved. This thesis studies following three link prediction based research problems in document networks: (i) Who influences other's actions in a collaborative research environment?, (ii)which documents get cited by a document that joins a citation network?, and (iii)which is the correct entity for an entity mention in free text?.;Among various computation methods to solve domain specific link prediction problem, statistical machine learning based techniques are an increasingly acceptable method due to their capability of modeling complex relationships among documents and document centric entities and dedicated efforts from research community to make the resulting intractable inference computationally scalable. This thesis proposes two types of statistical models: (1) models that mimic the generation process of document networks e.g. citation network of scientific documents, interconnected blog articles, web pages, etc.; (2) models that are capable of incorporating a specific task oriented features as supervision. The proposed statistical models are an extension of Latent Dirichlet Allocation, also known as topic models. In this work, I show how topic models can be adapted for the above mentioned link prediction problems. The proposed techniques perform superior to previous approaches for these link prediction problems.
Keywords/Search Tags:Link prediction, Document, Networks, Topic models, Problem, Citation
Related items