Font Size: a A A

Research On Representation Methods In Text Classification And Relevance Scenarios

Posted on:2024-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:G Q ZhuFull Text:PDF
GTID:2568306932461904Subject:Computer application technology
Abstract/Summary:
With the advent of the big data era in human society,a large amount of text data inundates the Internet today.Text data mining and analysis are of great significance to people’s lives and work.Text representation is the process of converting text data into a form that computers can understand,so that people can analyze and utilize it to improve their lives.Because of its universality,wide applicability and importance,it has become a research direction that has been enduring in the field of natural language processing.The two primary applications of text representation are text classification and text relevance.However,there is still potential for improvement in the research methods for these two scenarios.On the one hand,many texts in real-world scenarios are short texts with sparse semantics.Therefore,directly representing them often results in insufficient semantic information and accuracy.On the other hand,the Internet is full of text composed of keyword stacks that do not conform to grammatical rules.The representation of these texts often lacks good discrimination due to noise.In addition,real-world application scenarios have high requirements for the timeliness of text representation,but how to balance timeliness and excellent representation performance is still a problem that needs to be studied.To address the above issues,this thesis attempts to optimize existing text representation methods in the text classification and relevance scenarios.For the text classification problem,this thesis proposes a classification method called SDA that focuses on semantic dependencies and associations in text.By fully mining the semantic dependency relationship of the text itself and integrating external prior knowledge concepts,the representation information of the text is enriched and the classification effect is improved.For the text relevance problem,this thesis proposes a distilled learning model called TRGD that is aware of text relationships.By comparing accurate texts with noisy texts that are related to them through contrastive learning,the problem of representation collapse is alleviated.Distilled learning is then used to improve model performance without affecting model inference efficiency.Specifically,the research content and main contributions of this thesis are as follows:(1)A text classification method based on semantic dependency and association,namely SDA,is proposed.This method constructs the text into a heterogeneous graph,incorporating both the dependency relations among words and their associative relations with external knowledge,to enable deep mining of semantic information.Furthermore,through the utilization of graph neural networks and attention mechanisms,this method facilitates semantic interactions on the graph,thereby obtaining the final representation for text classification.The effectiveness of the proposed method is validated by its superior performance against multiple baseline models across four public datasets,demonstrating its effectiveness.(2)A text relevance method called TRGD based on text relational graphs and distilled learning is proposed.Firstly,the method applies contrastive learning on the text relational graphs,which effectively mitigates the influence of text noise.Subsequently,the method adopts distillation learning to fit the representation-based model to the output of the interaction-based model,thereby enabling the model to achieve performance close to the interaction-based model without increasing inference overhead.This approach combines the efficient inference capability of the representation-based model and the good performance of the interaction-based model,thus enhancing both the accuracy and efficiency of the relevance tasks.The effectiveness of this method is confirmed by its significant improvement over multiple baseline models on a large-scale real-world dataset.
Keywords/Search Tags:Representation Learning, Text Classification, Text Relevance, Graph Neural Networks, Contrastive Learning, Distillation Learning
Related items