Font Size: a A A

Term Relatedness from Wiki-Based Resources Using Sourced PageRank

Posted on:2011-01-21Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Weale, Timothy FitzgeraldFull Text:PDF
GTID:1448390002957420Subject:Engineering
Abstract/Summary:
This dissertation concerns itself with creating a new algorithm for automatically measuring the amount of relatedness between a given pair of terms. Research into term relatedness is important because it has been empirically demonstrated that using relatedness metrics can improve the performance of tasks in Natural Language Processing and Information Retrieval by expanding the usable vocabulary. Previous relatedness metrics have used a variety of sources of semantic data to judge term relatedness, including text corpora, expertly-constructed resources and, most recently, Wikipedia and Wiktionary. The primary focus of this dissertation is the creation of a new metric for deriving term relatedness from the graph structure of Wikipedia and Wiktionary using Sourced PageRank, a modified version of the PageRank algorithm, to generate the relatedness values.;This new algorithm is compared to several existing relatedness metrics in two established task domains. The first domain measures the metric's ability to replicate human-generated relatedness values for term pairs. The second domain tests a metric's ability to select the synonym of a given term from a list of possible candidates. In both of these experiments, the Sourced PageRank-based term relatedness algorithm that uses Wiktionary as its source of semantic data is able to compete with or exceed the performance of existing state-of-the-art algorithms in these task domains.;Additionally, the different emphases of Wikipedia and Wiktionary are covered as part of this dissertation. This is an area that has not been emphasized in past work with Wiki-based relatedness metrics. We find that Wikipedia is a source of information on proper names and their real-world referents, including corporations, events and people. Wiktionary has more information on common words that almost everyone knows. Each Wiki-based resource has its own strength and must be matched with the needs of the task in order to yield maximum benefits.;Finally, we explore how to use additional information found in Wikipedia and Wiktionary as metadata for graph manipulation. While we achieve mixed results, the investigation opens another area of research for graph-based relatedness metrics that use Wikipedia or Wiktionary as the source of semantic data.
Keywords/Search Tags:Relatedness, Source, Semantic data, Wiktionary, Wikipedia, Wiki-based, Using, Algorithm
Related items