Font Size: a A A

News analysis for the social sciences

Posted on:2010-06-05Degree:Ph.DType:Dissertation
University:State University of New York at Stony BrookCandidate:Bautin, MikhailFull Text:PDF
GTID:1448390002973519Subject:Computer Science
Abstract/Summary:
The Lydia system analyzes spatial, temporal, and linguistic statistics of named entity occurrences in news text. As a result, it provides the user with a bird's-eye view of the media coverage for a specific entity. This system can be a valuable tool for social sciences such as political science, history, and economics. Our time series data on the evolution of associations of an entity with other entities and traits is helpful in testing hypotheses in political science and history. Our maps showing the spatial distribution of entity popularity, sentiment polarity, and subjectivity towards that entity in the country are useful in reasoning about the level of success of a political campaign. Sentiment time series can be incorporated into studies of media influence and used to study voter behavior in response to local or national media coverage.;Processing large-volume historical news datasets is computationally intensive. To avoid the bottleneck (which our legacy Lydia system had) of a single relational database and to dramatically increase the scale of our news analysis, we have designed a new data aggregation and processing architecture that encompasses components such as text and derived statistics processing phases, an on-demand data retrieval server, a user interface web application, and experimental components aimed at validation and improvement of individual analysis phases.;This new Lydia architecture (code-named "Freedonia") which is based on the Hadoop open-source map-reduce framework, is fully scalable and capable of analyzing statistics on over 74 million entities in more than 100 million U.S. daily news articles in a few hours on a 18-node cluster. It provides a 10-20x performance improvement over the old Lydia system even on datasets that the old system can still scale up to. The new Lydia architecture contains scalable versions of duplicate article detection, entity sentiment time series calculation, cross-document co-referential entity name identification, and statistic aggregation across groups of co-referential entities and other types of groups. The data is accessible to social scientists through a web interface, and advanced users can access our data services programmatically through an appropriate API. The new Lydia system was used to provide media coverage data of the 2008 presidential election for the National Annenberg Election Survey, the largest academic public opinion survey conducted during presidential elections. With the help of the new Lydia system we study the differences in news coverage of various cultural/ethnic/linguistic groups in the U.S. news. In particular, we examine the frequency and sentiment volume time series of persons of various ethnicities and explore the geographic biases of ethnic group coverage. Finally, we include two studies aimed at the validation and improvement of different aspects of our named entity-centered text analysis system.;Firstly, we generalize the Lydia sentiment analysis approach to languages other than English. We utilize state-of-the-art machine translation technology and perform sentiment analysis on the English translation of foreign language text. Our experiments indicate that entity sentiment scores obtained by our method are significantly correlated across nine languages of news sources and five languages of a parallel corpus and can be used to perform meaningful cross-cultural comparisons.;Secondly, we consider the problem of finding the relevant named entities in a text corpus in response to a free-text search query. We analyze the AOL search query logs to assess the significance of this problem. Then, we describe and evaluate our implementation of a concordance-based entity search engine retrieving entity results based on entity occurrence contexts in the corpus.
Keywords/Search Tags:News, Entity, Lydia system, Text, Time series, Social
Related items