Font Size: a A A

Topic Awareness Model And Training Efficiency Optimization For Text Multi-label Classification

Posted on:2023-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:J B ZhaoFull Text:PDF
GTID:2558306620971059Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
After entering the information age,the human society has produced and accumulated massive text data in work and life.How to classify these data accurately,and then manage them scientifically has very important practical significance.Text multi-label classification refers to the process of determining the subset of the tag according to the document content under a given tag set,which has many practical applications in real life.The existing mainstream research paradigm is mainly devoted to modeling the relationship between words in documents and labels and focuses on analyzing the correlation between labels.However,the shortcomings of insufficient use of document clues and neglect of semantic association between words and labels are common,which cannot fully describe the semantic similarity and difference between documents,reducing the effect of document representation learningTo solve the above problems,this thesis designed a topic awareness model for multilabel classification.Firstly,the existing Glove model was improved,words and corresponding labels in documents were combined and counted into the co-occurrence matrix.The embedding vectors of words and labels in the same feature space were obtained by training on the corpus used for classification.Then,based on existing deep learning models,several topic vectors initialized by topic model are introduced.The original document features modeled by deep network are mapped to different topic vectors through attention mechanism to obtain multiple fine-grained document features focusing on different topics.Finally,the fine-grained document features are fused with the original features,and then interact with the embedding vector of the label to obtain the classification result.By embedding words and labels into the same feature space,this method captures the semantic association of words and labels,generates fine-grained document features through topic awareness,makes full use of the clues of documents,and models the implicit relationship between documents and labels more comprehensively on the basis of existing methods.Experiments show that,this method can improve the precision of classification greatly.At the same time,faced with the challenge of text multi-label classification brought by the increasing volume of text data,aiming at the problems of large memory overhead and low computational efficiency when processing large-scale data,based on the existing methods.This thesis summarizes three training efficiency optimization methods: loading data into memory in batches,constructing efficient data pipeline between CPU and GPU,and multi-GPU joint training.Experimental results fully confirm that these methods can significantly improve training efficiency and reduce memory consumption,which has important practical significance.
Keywords/Search Tags:Text multi-label classification, Embedded representation, Topic awareness, Fine-grained features
PDF Full Text Request
Related items