Topic segmentation is the task of partitioning a text into topically coherent segments.Then each segment can be assigned a category given the topics.The process of topic segmentation and segment labeling can facilitate humans’ understanding of the text,and can support a variety of downstream tasks such as text summarization,question answering,information retrieval,and dialogue modeling.Existing models for supervised topic segmentation can be grouped into three categories: sequence labeling models,topic-shift-based models,and generative models.Sequence labeling models and topic-shift-based models focus on the local semantics,but lack the ability to capture long-range dependencies.Existing generative models convert topic segmentation into a generation task,and identify the segment boundaries by generating the indices,but their performance is very much limited.In addition,the ambiguity of segment boundaries and the annotation noise which exist common in scenarios such as dialogues bring further challenges to existing models.To address the above issues,this paper first proposes a non-autoregressive sequence generation method applied in the dialogue domain,the Parallel Extraction Network with Neighbor Smoothing(PEN-NS).PEN-NS regards this problem as a segment extraction task,and it aims to predict all the segments given any topic.The Parallel Extraction Network is the backbone of PEN-NS,which performs parallel extractions for all segments for all topics by a hierarchical utterance encoder,an attentive segment encoder,a parallel extractor,and a bipartite matching optimizer.In addition,we propose neighbor smoothing,which assigns an exponential distribution to the smoothed labels of the extracted segments based on the distances to the ground truth segments.Experiments on real-word dialogue datasets and document-based datasets show that PEN-NS outperforms state-of-the-art models significantly.Further experiments validate the effectiveness of neighbor smoothing for alleviating the boundary ambiguity and annotation noise.For text topic segmentation and classification,we further propose a Sequence-to-Sequence Approach with Mix Pointers(Seq2Seq-MP),which converts topic segmentation and classification into a generative task,and significantly outperforms existing generative models,sequence labeling models,and segment extraction models.Seq2Seq-MP combines pre-trained models to encode text semantics,uses pointer networks to uniformly generate segment boundaries and topics,and solves topic segmentation and classification tasks end-to-end.In order to make better use of the type information of the input vector(i.e.whether the input is a sentence vector or a topic vector),we propose the pairwise type encoding and the type-aware relative position encoding.These two encoding modes explicitly introduce the type information into the Transformer model,and fuse the type information with the position information.Experiments on publicly available datasets for text topic segmentation and classification show that Seq2Seq-MP outperforms the best existing models(including PEN-NS),and we validate the effectiveness of each component of Seq2Seq-MP through ablation studies.Further analysis shows that Seq2Seq-MP has a lower error rate in boundary prediction and topic classification,and can better model long-range dependencies. |