Research On Sequence Generation Methods For Topic Segmentation And Classification

Posted on:2024-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:J X Xia

Full Text:PDF

GTID:2568307070999099

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Topic segmentation is the task of partitioning a text into topically coherent segments.Then each segment can be assigned a category given the topics.The process of topic segmentation and segment labeling can facilitate humans’ understanding of the text,and can support a variety of downstream tasks such as text summarization,question answering,information retrieval,and dialogue modeling.Existing models for supervised topic segmentation can be grouped into three categories: sequence labeling models,topic-shift-based models,and generative models.Sequence labeling models and topic-shift-based models focus on the local semantics,but lack the ability to capture long-range dependencies.Existing generative models convert topic segmentation into a generation task,and identify the segment boundaries by generating the indices,but their performance is very much limited.In addition,the ambiguity of segment boundaries and the annotation noise which exist common in scenarios such as dialogues bring further challenges to existing models.To address the above issues,this paper first proposes a non-autoregressive sequence generation method applied in the dialogue domain,the Parallel Extraction Network with Neighbor Smoothing(PEN-NS).PEN-NS regards this problem as a segment extraction task,and it aims to predict all the segments given any topic.The Parallel Extraction Network is the backbone of PEN-NS,which performs parallel extractions for all segments for all topics by a hierarchical utterance encoder,an attentive segment encoder,a parallel extractor,and a bipartite matching optimizer.In addition,we propose neighbor smoothing,which assigns an exponential distribution to the smoothed labels of the extracted segments based on the distances to the ground truth segments.Experiments on real-word dialogue datasets and document-based datasets show that PEN-NS outperforms state-of-the-art models significantly.Further experiments validate the effectiveness of neighbor smoothing for alleviating the boundary ambiguity and annotation noise.For text topic segmentation and classification,we further propose a Sequence-to-Sequence Approach with Mix Pointers(Seq2Seq-MP),which converts topic segmentation and classification into a generative task,and significantly outperforms existing generative models,sequence labeling models,and segment extraction models.Seq2Seq-MP combines pre-trained models to encode text semantics,uses pointer networks to uniformly generate segment boundaries and topics,and solves topic segmentation and classification tasks end-to-end.In order to make better use of the type information of the input vector(i.e.whether the input is a sentence vector or a topic vector),we propose the pairwise type encoding and the type-aware relative position encoding.These two encoding modes explicitly introduce the type information into the Transformer model,and fuse the type information with the position information.Experiments on publicly available datasets for text topic segmentation and classification show that Seq2Seq-MP outperforms the best existing models(including PEN-NS),and we validate the effectiveness of each component of Seq2Seq-MP through ablation studies.Further analysis shows that Seq2Seq-MP has a lower error rate in boundary prediction and topic classification,and can better model long-range dependencies.

Keywords/Search Tags:

topic segmentation, segment labeling, parallel extraction, sequence-to-sequence

PDF Full Text Request

Related items

1	Research On Text Causality Extraction Based On Deep Learning And Sequence Labeling
2	Research On Object Extraction Of Automobile Product Based On Sequence Labeling
3	Research On Event Extraction Algorithm Based On Sequence Labeling Model
4	Research On Emotion-cause Pair Extraction Based On Sequence Labeling And Transition Methods
5	Research On Chinese Word Segmentation Sequence Labeling Method Based On Multi-task Learning
6	Researches On Sequence Labeling Models In Natural Language Processing
7	Moving Objectâ€™s Denoise, Segmentation And Detection Algorithm Based On Video Sequence
8	Research Of Sequence Labeling Technics Based On Graph Models
9	Research On Heterogeneous Data Exploitation For Sequence Labeling
10	Research On Joint Learning Of Sequence Labeling In Natural Language Processing