Font Size: a A A

Research On BERT-based Chinese Long Text Classification Algorithm

Posted on:2022-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:C BaoFull Text:PDF
GTID:2518306758466104Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In the face of massive text data,using a high-accuracy classification model for document management can not only optimize the back-end data warehouse in a fine-grained manner,but also allow users to quickly obtain the required feedback information.Since deep learning can automatically learn the advanced features of text compared with traditional machine learning,this paper is based on deep learning.Research is carried out on the problem of ignoring the obvious semantic features of text in the current document classification and the problem of shallow structure in the existing hierarchical model,in order to improve the accuracy of document classification.The main research work in the dissertation is as follows:(1)In view of the limitation of BERT(Bidirectional Encoder Representations from Transformers)on the length of input text,a document partition algorithm is proposed.By dividing the document into small blocks as the input of the BERT model,the complexity of the text representation stage is reduced from O(n~2)to O(ns),where n and s are the length of the input text and the length of the divided small block text respectively.(2)Directing at the problem that the current hierarchical algorithm in document classification only uses the global target vector as the text sentence vector representation and ignores the obvious semantic features of the text,a segmentation attention document fusion model based on fusion features is proposed.The feature vector of convolution maximum pooling and the sentence vector generated by the BERT model are combined to represent the local text features comprehensively.On this basis,the global information of the document is obtained through the bi-directional long short-term memory network,and the basic attention mechanism is introduced to focus on the key points for document classification.(3)The existing hierarchical model is a shallow structure,which ignores the structural characteristics of documents.Inspired by Chinese document structure and hierarchical attention mechanism,a hierarchical model based on self-attention mechanism is proposed.The model divides the document into three layers of"word-sentence-paragraph",and uses bi-directional gated recurrent unit and self-attention mechanism on each layer.Through hierarchical focus,focus on the key position of document,give greater attention to important words,sentences and paragraphs,and fully improve the ability of extracting document semantic information of document classification model.Two Chinese document datasets are collected in the dissertation:maritime dataset and Fudan University Chinese dataset.The experimental analysis shows that the proposed two document classification models can obtain better classification performance.
Keywords/Search Tags:Document Classification, BERT Model, Attention Mechanism, Convolutional Neural Network, Recurrent Neural Network
PDF Full Text Request
Related items