Font Size: a A A

Research On Text Classification Based On Improved Graph Convolution Neural Network

Posted on:2024-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhengFull Text:PDF
GTID:2568306938959129Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a technology that uses existing category labels to automatically label text data.It is a key task in the field of natural language processing.In the era of Big Data,text data on the Web is exploding.Therefore,text classification techniques are needed to effectively organize and manage the massive amount of data.Text classification techniques are widely used in various aspects of life,such as topic classification,sentiment analysis,opinion analysis,etc.The text classification method based on graph convolution can handle unstructured data well and achieve good classification results.The model learns feature representations of nodes by constructing graph structures and using convolution operations to take into account text structure information.However,the traditional graph convolution model has the following shortcomings:First,the model usually builds heterogeneous graphs based on the whole corpus,ignoring the unique structural features of individual texts and failing to mine the deep semantic information of individual texts.Second,the classification was previously performed using the conduction property.It needs to learn the content of the whole corpus and then conducts the classification by the learned knowledge.However,when the model encounters unlearned texts,it is difficult to complete the classification task.In addition,when constructing a text heterogeneity map,the edge weights between documents and words are usually expressed using TF-IDF weights,while the edge weights between words are expressed using PMI weights.The adjacency matrix constructed in this way cannot express the fine-grained structural features of the text.Meanwhile,the TF-IDF algorithm only considers the number relationship of word occurrences in the document and cannot effectively extract the semantic relationship between documents and words,while the PMI algorithm cannot accurately express the word co-occurrence relationship.In order to solve these problems,this paper proposes:1.Text Level Bert Graph Convolutional Network(Text-Level-BertGCN)based on BERT optimization is proposed.The model combines the advantages of both large-scale language pretraining and text-level graph neural network.First,BERT is used to initialize the word vertices of the text-level graph to construct a text-level graph containing text spatial structure and semantic information.Then,based on the word features of the text-level composition,the text category prediction is performed using the(BERT+GCN)and BERT models respectively,and the model is trained using the mixed loss of the two prediction results.The experimental results show that the models can achieve good text classification results,especially when dealing with long text datasets,and can better extract the semantic and structural information of long-range texts.2.A new method for building graph is proposed.Only the corpus data from the training documents are used to construct the text heterogeneous graph,and no test documents are used.TF-IWF and PPMI are used to construct the heterogeneous graph,thus constructing the PT adjacency matrix to effectively capture the fine-grained structural features of the text.The relationship between documents and words is constructed by the TF-IWF,which can more accurately express the dependency relationship between words and documents,through which the information at a distance can be aggregated;while the co-occurrence relationship between words and words is constructed by the PPMI,which can more accurately express the word cooccurrence relationship and effectively utilize the word co-occurrence information.In the training phase,weights are calculated by TF-IWF and a weighted average of word vectors is performed to represent the document vector,which can mine the semantic feature information implied by the words.The PT-InducTGCN(PPMI-TF-IWF-Inductive Graph Convolutional Networks)text classification model is proposed based on the new composition method,and the improved model improves the accuracy of inductive text classification tasks.
Keywords/Search Tags:Natural language processing, Text classification, Deep learning, BERT, Graph convolution neural network
PDF Full Text Request
Related items