Research On BERT-based Chinese Long Text Classification Algorithm

Posted on:2022-11-13

Degree:Master

Type:Thesis

Country:China

Candidate:C Bao

Full Text:PDF

GTID:2518306758466104

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

In the face of massive text data,using a high-accuracy classification model for document management can not only optimize the back-end data warehouse in a fine-grained manner,but also allow users to quickly obtain the required feedback information.Since deep learning can automatically learn the advanced features of text compared with traditional machine learning,this paper is based on deep learning.Research is carried out on the problem of ignoring the obvious semantic features of text in the current document classification and the problem of shallow structure in the existing hierarchical model,in order to improve the accuracy of document classification.The main research work in the dissertation is as follows:(1)In view of the limitation of BERT(Bidirectional Encoder Representations from Transformers)on the length of input text,a document partition algorithm is proposed.By dividing the document into small blocks as the input of the BERT model,the complexity of the text representation stage is reduced from O(n~2)to O(ns),where n and s are the length of the input text and the length of the divided small block text respectively.(2)Directing at the problem that the current hierarchical algorithm in document classification only uses the global target vector as the text sentence vector representation and ignores the obvious semantic features of the text,a segmentation attention document fusion model based on fusion features is proposed.The feature vector of convolution maximum pooling and the sentence vector generated by the BERT model are combined to represent the local text features comprehensively.On this basis,the global information of the document is obtained through the bi-directional long short-term memory network,and the basic attention mechanism is introduced to focus on the key points for document classification.(3)The existing hierarchical model is a shallow structure,which ignores the structural characteristics of documents.Inspired by Chinese document structure and hierarchical attention mechanism,a hierarchical model based on self-attention mechanism is proposed.The model divides the document into three layers of"word-sentence-paragraph",and uses bi-directional gated recurrent unit and self-attention mechanism on each layer.Through hierarchical focus,focus on the key position of document,give greater attention to important words,sentences and paragraphs,and fully improve the ability of extracting document semantic information of document classification model.Two Chinese document datasets are collected in the dissertation:maritime dataset and Fudan University Chinese dataset.The experimental analysis shows that the proposed two document classification models can obtain better classification performance.

Keywords/Search Tags:

Document Classification, BERT Model, Attention Mechanism, Convolutional Neural Network, Recurrent Neural Network

PDF Full Text Request

Related items

1	Research On Sentiment Analysis Algorithm Of Commodities Review Based On Convolutional Recurrent Neural Network
2	Research On Text Classification Model Based On BGRU And Self-Attention Mechanism
3	Research On Text Classification Model Based On Deep Neural Network
4	Research On Classification Of News Text Based On Deep Learning
5	Text Classification Research Based On Deep Neural Network And Attention Mechanism
6	Research On Long Text Classification Algorithm Via Multi-model Fusion With Attention Mechanism
7	Research On Short Text Sentiment Classification Model Based On Deep Learning
8	Text Representation And Classification Based On Deep Learning With Improved Attention Mechanism
9	Application Research On Automatic Classification Of Massive Academic Resources
10	Text Sentiment Classification Based On Deep Learning And Attention Mechanism