Generating natural language summaries from multiple on-line sources: Language reuse and regeneration

Posted on:2000-11-14

Degree:Ph.D

Type:Thesis

University:Columbia University

Candidate:Radev, Dragomir Radkov

Full Text:PDF

GTID:2468390014463580

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The abundance of newswire on the World-Wide Web has resulted in at least four major problems, which seem to present the most interesting challenges to users and researchers alike: size, heterogeneity, change, and conflicting information.;Size: several hundred newspapers and news agencies maintain their Web sites with thousands of news stories in each.;Heterogeneity: some of the data related to news is in structured format (e.g., tables); more exists in semi-structured format (e.g., Web pages, encyclopedias, textual databases); while the rest of the data is in textual form (e.g., newswire).;Change: most Web sites and certainly all news sources change on a daily basis.;Disagreement: different sources present conflicting or at least different views of the same event.;We have approached the second, third, and fourth of these four problems from the point of view of text generation. We have developed a system, SUMMONS, which when coupled with appropriate information extraction technology, generates a specific genre of natural language summaries of a particular event (which we call briefings) in a restricted domain. The briefings are concise, they contain facts from multiple and heterogeneous sources, and incorporate evolving information, highlighting agreements and contradictions among sources on the same topic.;We have developed novel techniques and algorithms for combining data from multiple sources at the conceptual level (using natural language understanding), for identifying new information on a given topic; and for presenting the information in natural language form to the user. We named the framework that we have developed for these problems language reuse and regeneration (LRR). Its novelty lies in the ability to produce text by collating together text already written by humans on the Web.;The main features of LRR are: increased robustness through a simplified parsing/generation component, leverage on text already written by humans, and facilities for the inclusion of structured data in computer-generated text.;The present thesis contains an introduction to LRR and its use in multi-document summarization. We have paid special attention to the techniques for producing conceptual summaries of multiple sources, to the creation and use of a LRR-based lexicon for text generation, to a methodology used to identify new and old information in threads of documents, and to the generation of fluent natural language text using all the components above.;The thesis contains evaluations of the different components of SUMMONS as well as certain aspects of LRR as a methodology. A review of the relevant literature is included as a separate chapter.

Keywords/Search Tags:

Natural language, Sources, LRR, Multiple, Web, Summaries, Generation, News

PDF Full Text Request

Related items

1	Deep Learning Natural Language Generation System For Scientific Literature Based On Microservices
2	Traditional News Sources and Mobile Media: Will the Millennial Generation's Use of Alternative News Sources Change How Journalism Is Taught in Higher Education
3	Research On Non-Autoregressive Models In Natural Language Generation
4	Research On Natural Language Generation In Task-based Dialogue System
5	Research On Natural Language Generation Techniques In The Large Language Model Era Of Deep Learning
6	Research On Natural Language Generation Technology For Electronic Commerce
7	Research On The Method Of Generating News Text Summaries Fused With Keywords
8	A Study Of Natural Language Use Analysis For News And Fiction
9	Research On Robustness Methods In Natural Languag Understanding And Generation
10	Multi-agent Parallel Study Of Natural Language Communication Capabilities In The Design Environment