Font Size: a A A

Generating natural language summaries from multiple on-line sources: Language reuse and regeneration

Posted on:2000-11-14Degree:Ph.DType:Thesis
University:Columbia UniversityCandidate:Radev, Dragomir RadkovFull Text:PDF
GTID:2468390014463580Subject:Computer Science
Abstract/Summary:
The abundance of newswire on the World-Wide Web has resulted in at least four major problems, which seem to present the most interesting challenges to users and researchers alike: size, heterogeneity, change, and conflicting information.;Size: several hundred newspapers and news agencies maintain their Web sites with thousands of news stories in each.;Heterogeneity: some of the data related to news is in structured format (e.g., tables); more exists in semi-structured format (e.g., Web pages, encyclopedias, textual databases); while the rest of the data is in textual form (e.g., newswire).;Change: most Web sites and certainly all news sources change on a daily basis.;Disagreement: different sources present conflicting or at least different views of the same event.;We have approached the second, third, and fourth of these four problems from the point of view of text generation. We have developed a system, SUMMONS, which when coupled with appropriate information extraction technology, generates a specific genre of natural language summaries of a particular event (which we call briefings) in a restricted domain. The briefings are concise, they contain facts from multiple and heterogeneous sources, and incorporate evolving information, highlighting agreements and contradictions among sources on the same topic.;We have developed novel techniques and algorithms for combining data from multiple sources at the conceptual level (using natural language understanding), for identifying new information on a given topic; and for presenting the information in natural language form to the user. We named the framework that we have developed for these problems language reuse and regeneration (LRR). Its novelty lies in the ability to produce text by collating together text already written by humans on the Web.;The main features of LRR are: increased robustness through a simplified parsing/generation component, leverage on text already written by humans, and facilities for the inclusion of structured data in computer-generated text.;The present thesis contains an introduction to LRR and its use in multi-document summarization. We have paid special attention to the techniques for producing conceptual summaries of multiple sources, to the creation and use of a LRR-based lexicon for text generation, to a methodology used to identify new and old information in threads of documents, and to the generation of fluent natural language text using all the components above.;The thesis contains evaluations of the different components of SUMMONS as well as certain aspects of LRR as a methodology. A review of the relevant literature is included as a separate chapter.
Keywords/Search Tags:Natural language, Sources, LRR, Multiple, Web, Summaries, Generation, News
Related items