Font Size: a A A

Research On Topic Segmentation Based On Deep Learning

Posted on:2024-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:2568307151960709Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing frequency of online activities,a large amount of textual data is generated,making text topic segmentation based on semantic consistency an increasingly important issue in both academia and industry.Efficient topic segmentation can not only help readers read more effectively,but also facilitate downstream tasks such as text summarization,information retrieval,and fine-grained sentiment classification.However,existing topic segmentation models suffer from poor performance-resource balance in topic segmentation of long texts.Therefore,this paper proposes two methods by improving the quality of sentence representations learned within the model and increasing the utilization of contextual information respectively to solve these problems.Firstly,to address the issue of poor text sequence encoding ability of the LSTM encoder in existing hierarchical text topic segmentation models,we propose a Sliding Window Attention Gated Recurrent Unit(SWAGRU)and use it as the encoder to construct the hierarchical segmentation model SWAGRUSeg.The SWAGRU encoder introduces local attention into the recurrent unit by using a sliding window and integrates it with the classic recurrent neural network,thereby enhancing the model’s utilization of lexical structures such as word pairs and improving its encoding ability for text sequences,ultimately achieving the goal of improving the overall performance of the topic segmentation model.The model’s effectiveness has been validated on multiple datasets,including Wiki-727 K,Wiki-50,Choi,Clinical,and others,and the results show that SWAGRUSeg can outperform the baseline model while maintaining a relatively small number of parameters.Secondly,to address the weak performance of the feature aggregator used to generate intermediate sentence representations and the contextual semantic loss caused by long text segmentation,we propose two improvements based on the SWAGRUSeg model: the Global Attention based Pooler(GAP)and the Horizontal Semantic Cache Module(HSCM).The GAP introduces global attention to the word encoder,which can compensate for the insufficient global attention of local attention on the sequence,while enhancing the participation of word embeddings in all time steps of the recurrent network.This approach can avoid the gradient disappearance problem when using the final hidden state as a sentence representation and allow themodel to converge faster.The HSCM introduces a horizontal semantic coding module to the sentence embedding sequence,which can extend the previous information as much as possible to the next text block while stabilizing the model resource usage through the use of segmentation.The effectiveness of the improvements on the SWAGRUSeg model has been validated on multiple datasets,including Wiki-727 K,Wiki-50,Choi,and Clinical,and the results show that the proposed improvements based on the SWAGRUSeg model are effective and achieve the best performance on multiple datasets.
Keywords/Search Tags:natural language processing, recurrent neural networks, LSTM, attention mechanism, topic segmentation
PDF Full Text Request
Related items