Font Size: a A A

Research Of Text Classification Algorithm Based On Document Representation

Posted on:2020-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:B ShuFull Text:PDF
GTID:2428330575496900Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rise of deep learning,the generation of large amounts of data,including text,speech,pictures,etc.,how to learn useful features from large amounts of data is currently the most important component.In the field of natural language processing,learning document representation is essential for the precise understanding of natural language,and can be applied to a variety of natural language processing tasks,including text categorization,text similarity matching,naming recognition,and so on.This paper focuses on the research of recurrent neural network and BERT model,optimizes the input or output of these two network architectures,improves the generalization performance of the model,and studies the text classification task to verify the scalability of the classification algorithm.The results and main work of this paper are as follows:1.The long short-term memory networks training text classification task is not effective.In order to better learn the document representation training text classification,The long short-term memory network with pooling and dropout is designed to represent the document,pooling can retain the main features while reducing parameters and achievies fixed-length output;dropout also acts to prevent over-fitting and improve generalization performance for supervised learning documents.compared with the model of word bag model,convolutional neural network,long short-term memory network,the long short-term memory network optimized on the four data sets has at least 0.2% improvement in accuracy compared with the direct use of long short-term memory networks.2.For the BERT model that currently performs well in the field of natural language processing,the probability distribution of the output layer softmax generation category is too single.At the same time,inspired by the mixture softmax,the softmax layer of the BERT is optimized,and the improved version of the mixture softmax is used,and the integration is utilized.The idea of weighting the output of each softmax,the effect on the four data sets is more than 1% better than the accuracy of the BERT-Base model.
Keywords/Search Tags:dropout, LSTM, BERT, Pool, mixture softmax
PDF Full Text Request
Related items