Font Size: a A A

Algorithm Research For Chinese Text Multi-label Classification

Posted on:2021-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:W Z LiuFull Text:PDF
GTID:2428330623467816Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the basic natural language processing tasks,classification problem has a wide range of applications in recommendation systems,search engines,and public opinion analysis.The multi-label text classification studied in this paper is commonplace in real life.At present,multi-label classification algorithms generally have the problem that the model does not consider or less consider the correlation between sample labels during classification.Obviously,this approach is not in line with traditional cognition.The second problem is that the natural language processes the underlying text.Limitations of the representation method.In the past,word vector representations had the problem of being ambiguous at one time or unable to obtain context information at the same time.In this regard,the prior knowledge and optimization of label-related knowledge to better text representation has become an urgent problem.Aiming at the first problem,the article adopts the method of extracting text representation based on bert language model.This method fundamentally solves the problem of polysemy and replaces LSTM with a more advanced transformer text feature extractor because it has super The advantages of long-distance feature extraction,efficient parallelism,and fast convergence speed are a processing method capable of bidirectionally encoding text information,so that the result of text representation contains more text position information and sequence information.It is more robust.For the tag correlation problem,the article gives two solutions based on the characteristics of the data set:1)For the short text datasets with small volume,relatively obvious features,few label combinations and high co-occurrence among some labels,the label association can be used to describe the label relevance.At the same time,the multi label classification can be transformed into a multi classification problem,then text classification based on TextCNN,and finally use focal loss to alleviate category imbalance in the dataset.2)For a more general data set,the article uses the basic architecture of seq2 seq to describe,and the correlation between the labels is simulated by the generation task at the decoder.The encoder consists of a transformer encoder,and the decoder is strictly Data analysis still uses LSTM.At the same time,a hybrid self-attention mechanism is added at the encoder side to incorporate the prior knowledge that the words in the text are often closely related to its context into the model to enhance its coding ability;a masked softmax is implemented at the decoder side to avoid duplication The problem of prediction,and the label vector sharing mechanism is used to avoid exposure bias falling into the local optimum.In the experimental verification part,after the original data was enhanced by EDA data to alleviate the imbalance of categories,the above method was verified on the two data sets at the end of the model through the model.Finally,the F1 score of 0.8210 and 0.8465 were obtained good results,prove the rationality of the two model structures.
Keywords/Search Tags:label relevance, seq2seq, transformer, text representation
PDF Full Text Request
Related items