Font Size: a A A

Automatic extractive summarization on meeting corpus

Posted on:2011-07-31Degree:Ph.DType:Thesis
University:The University of Texas at DallasCandidate:Xie, ShashaFull Text:PDF
GTID:2448390002461040Subject:Computer Science
Abstract/Summary:
With massive amounts of speech recordings available, an important problem is how to efficiently process these data to meet the user's need. Automatic summarization is very useful techniques that can help the users browse a large amount of data. This thesis focuses on automatic extractive summarization on meeting corpus. We propose improved methods to address several issues in existing text summarization approaches, as well as leverage speech specific information for meeting summarization.;First we investigate unsupervised approaches. Two unsupervised frameworks are used in this thesis for summarization, Maximum Marginal Relevance (MMR) and the concept-based global optimization approach. We evaluate different similarity measures under the MMR framework to better measure the semantic level information. For the concept-based method, we proposed incorporating and leveraging sentence importance weights so that the extracted summary can cover both important concepts and sentences.;Second we treat extractive summarization as a binary classification problem, and adopt supervised learning methods. In this approach, each sentence is represented by a rich set of features, and positive or negative label is assigned to indicate whether the sentence is in the summary or not. We evaluate the contribution of different features for meeting summarization using forward feature selection. To address the imbalanced data problem and human annotation disagreement, we propose using various sampling techniques and a regression model for the extractive summarization task.;Third, we focus on speech specific information for improving the meeting summarization performance. In supervised learning, we incorporate acoustic/prosodic features. Since the prosodic and textual features can be naturally split into two conditionally independent subsets, we investigate using the co-training algorithm to improve the classification accuracy by leveraging the unlabeled data information. When using the ASR output for summarization, the summarization results are often worse than using the human transcripts because of high word error rate in meeting transcripts. We introduce using rich speech recognition results, n-best hypotheses and confusion networks, to improve the summarization performance on the ASR condition. All of these proposed methods yield significant improvement over the existing approaches.
Keywords/Search Tags:Summarization, Meeting, Automatic, Data, Speech
Related items