Font Size: a A A

Lydia: A system for the large scale analysis of natural language text

Posted on:2007-06-21Degree:Ph.DType:Dissertation
University:State University of New York at Stony BrookCandidate:Lloyd, LevonFull Text:PDF
GTID:1448390005973366Subject:Computer Science
Abstract/Summary:
The increasing volume of informative content on the World-Wide Web coupled with decreasing costs of computation and communication have created exciting new opportunities in text mining. Towards this end, we have started the Lydia project, a natural language system for rapidly assimilating the primary vocabulary associated with uniform sources of text, and extracting relations between them.; Our system address two main issues: (1) The development of a relatively domain-independent, large-scale system for text analysis. (2) Exploring the applications of a database of information that has been extracted from a large set of documents.; In order to keep up with the daily flow of dynamic sources (e.g. news, blogs), our system needs to be efficient. Chapter 3 gives a detailed description of the pipeline architecture we chose for Lydia and how it was designed to meet this requirement. It is capable of retrieving a daily newspaper like The New York Times and then analyzing the resulting stream of text in roughly one minute of computer time.; Periodical publications represent a rich and recurrent source of knowledge on both current and historical events. Chapter 4 gives the results of running Lydia on a corpus of newspaper text. We explain how we acquire our data, then we introduce two applications that we built on top of Lydia : heatmaps, which is our system for visualizing the geographic buzz surrounding a topic and juxtaposition analysis, which is our way of computing the significance of co-occurrence relationships.; A single logical entity can be referred to by several different names over a large text corpus. Chapter 5 presents our algorithm for finding all such co-reference sets in a large corpus. Our algorithm involves three steps, morphological similarity, contextual similarity, and clustering. Finally, we present experimental results on a large corpus of real news text to verify our techniques.; In Chapter 6 we study the problem of disambiguating references to named people in web data. Each name spotted online is shared by several hundred people on average, and teasing apart these references is critical for a new family of person-aware analytical applications. We present and evaluate algorithms for this problem, and give results to indicate that 25% of personal references may be successfully disambiguated with precision in excess of 95%, but that larger fractions cause a significant decline in precision.; Blogs and formal news sources both monitor the events of the day, but with substantially different frames of reference. In Chapter 7, we report on experiments comparing over 500,000 blog postings with the contents of 66 daily newspapers over the same six week period. We compare the prevalence of popular topics in the blogspace and news, and in particular analyze lead/lag relationships in frequency time series of 197 entities in the two corpora. The correlation between news and blog references proved substantially higher when adjusting for lead/lag shifts, although the direction of these shifts varied for different entities.; The thousands of specialized structured file formats in use today present a substantial barrier to freely exchanging information between applications programs. In Chapter 8, we consider the problem of deducing such basic features as the whitespace characters, bracketing delimiter symbols, and self-delimiter characters of a given file format from one or more example files. We demonstrate that for sufficiently large example files, we can typically identify the basic features of interest.
Keywords/Search Tags:Large, Text, System, Lydia
Related items