Font Size: a A A

Research On The Representations Of Word And Text And Text Classification

Posted on:2021-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:S GuoFull Text:PDF
GTID:1488306314499664Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Representation learning is a fundamental problem in natural language processing(NLP)and is particularly crucial for text classification tasks.This dissertation mainly starts from word and text representations to explore the properties of word and text representations,and finds out and solves three problems in text classification,including text information overflow,coarse-grained text information and text semantic ambiguity.(1)All models have the problem of text information overflow when dealing with long texts.The text representation models based on neural networks usually define a fixed input length at the input layer for truncating the too long texts,so as to avoid the problem of text information overflow as much as possible.However,for the text representation models based on linear operation,the impact of this problem is more serious and there is still no effective solution.To solve the problem of this type of models,we propose the concepts of word and text containers,and further present a text representation model based on container from the perspective of container.This method mainly uses clustering algorithms to divide the text container into several sub-containers.According to the different clustering algorithms,we propose two models:the text vector extension model based on k-means clustering and the text vector extension model based on random clustering.They can effectively extend the text vector and enhance the ability of storing information.Experiments on five long text classification datasets show that the proposed model can effectively solve the problem of information overflow and improve the quality of text representation.(2)The problem of coarse-grained text information is mainly due to the single expression way of word and text.Recently,almost all models use the form of vector to express word and text.Obviously,the features of the single dimension are limited.To solve this problem,we study the fine-grained feature representation method and propose the idea of word and text matrix representations.We successfully introduce the idea of matrix into the text representation models based on linear operation and construct an effective matrix representation architecture of word and text,namely doc2matrix.Doc2matrix constructs matrix representation by defining fine-grained sub-windows,including two methods of generating word matrix and three methods of generating text matrix.Additionally,to effectively capture the two-dimensional features of text matrix,we also propose a convolutional-based text matrix classifier.Experiments on four long text classification datasets show that doc2matrix can generate higher quality word and text representation and is better than some existing text representation models based on linear operation.Experiments also prove that the word and text matrix obtained by doc2matrix can not only contain more fine-grained semantic features in the text but also introduce rich two-dimensional features.(3)Semantic ambiguity is the main reason for leading to the decline of text classification performance.The reason for leading to semantic ambiguity is that models cannot generate word representations that match their current context semantics for important words in text,such as polysemous words.For dynamic representation models,such as ELMo and BERT,although they can generate dynamic word and text representations based on their current contexts,they only utilize the complex network layers to process semantics contained in text and still use the traditional method of expressing words in the embedding layer,that is,one word is mapped to one fixed vector.Obviously,the word and text representations obtained by dynamic representation models still contain mixed information from different semantic contexts.Thus,they cannot completely solve the problem of semantic ambiguity.To solve this problem,we mainly study the multi vector representations of word and propose a method to optimize the representations of word and text,that is,the Polysemous-Aware Vector Representation Model(PAVRM).PAVRM uses a polysemous word recognition method based on context clustering to identify polysemous word in corpus.Additionally,PAVRM also uses two methods to construct polysemous word representation:a polysemous word representation method based on context clustering(PAVRM-context)and a polysemous word representation method based on center vector(PAVRM-center).Experiments on three standard text classification tasks and one custom text classification task show that PAVRM can effectively identify polysemous words in text and generate polysemous word representations that closer to their current context semantics.Additionally,PAVRM can also be effectively introduced into the existing models to improve the classification performance of the original models.
Keywords/Search Tags:Word container, Text container, Matrix representation, Polysemous words
PDF Full Text Request
Related items