When writing an article,it is common to organize text segments around specific themes to achieve semantic coherence and logicality between segments.Accurately dividing text into semantic segments and mapping them to structured topics not only conforms to readers’ reading habits but also facilitates other text-related tasks,such as information retrieval,intent detection,and sentiment analysis.However,accurately segmenting the original text based on semantics and identifying the main idea of each segment remains a challenging and uncertain task.On the one hand,many fields of documents lack segment and topic labels,such as medicine and geography,which requires text segmentation techniques to rely on a small amount of labeled data to learn topicrelated word information.On the other hand,documents from different data sources may have the same topic.Therefore,text segmentation techniques need to have some transferability and robustness,that is,they can perform well on other data sets after being trained on one dataset.However,there is still significant room for improvement in current text segmentation techniques.Firstly,some existing methods use sentence topic distributions output by topic models trained via Gibbs sampling as semantic representations for sentences.However,these approaches often require non-gradient descent training,which leads to low efficiency and limited applicability.Additionally,sentence representations directly generated by pre-trained language models are prone to being trapped in local space,resulting in high similarity between sentences,which makes it difficult for text segmentation methods based on language models to discover semantic differences between sentences.Therefore,this thesis proposes a method based on topic models and pre-trained language models to better address the tasks of text segmentation and segment topic prediction.Additionally,considering the scarcity of segment topic labels,this thesis proposes an unsupervised pre-training method based on sentence-level and segment-level data augmentation,using contrastive learning to improve model performance.The main research contents and contributions of this thesis include the following three aspects;(Ⅰ)To address the limitations of topic models trained using Gibbs sampling,which make them difficult to train using gradient descent,and the tendency of sentence representations generated by pre-trained language models to be trapped in local space,this thesis proposes a semantic-guided and topic-guided text segmentation method.This method can effectively utilize the topic information extracted by the topic model and semantic information derived by the pre-trained language model to more accurately identify segment boundaries and predict segment topics.(Ⅱ)As existing text segmentation datasets generally lack segment topic labels,this thesis proposes an unsupervised pre-training method based on sentence-level and segment-level data augmentation.Using a contrastive learning paradigm,the model learns more discriminative representations,thereby improving segment topic prediction performance.This pre-training method not only improves the performance of the text segmentation method proposed in this thesis,but also significantly enhances the performance of previous text segmentation methods.(Ⅲ)By conducting extensive experiments on multiple publicly available datasets,this thesis demonstrates the effectiveness and feasibility of the proposed method in text segmentation and segment topic classification tasks.Compared to existing approaches,the proposed methods achieve superior performance,indicating their practical value and potential applications. |