Font Size: a A A

Research On Long Document Classification Method Based On Deep Learning

Posted on:2021-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:T J JiangFull Text:PDF
GTID:2428330647952754Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology,the Internet and database technology has been rapidly improved.In daily life,various fields of social life generate a lot of data and information every minute and second,such as a large number of text data.There is a great demand for text data in academic,industrial research,technology companies and other fields.The texts processed in these fields are often long and contain more information,which makes long text management become a hot topic for researchers.Text classification is the basic task of text management,which has important applications in information retrieval,information filtering,emotion classification and so on.Text classification refers to the process of automatic text classification according to the specific information contained in the text under a certain classification system.The text classification method in deep learning is to accurately extract the semantic features of the text and make the judgment of the category by establishing an efficient neural network model.In order to ensure the integrity of semantic extraction,the traditional classification method usually uses the whole text coding as the input of the network,and then uses convolutional neural network or cyclic neural network for training.This method has achieved good results in sentence and short text classification.For long documents,the overall coding will lead to the model input dimension is too large,and the overall calculation complexity of the model for the text is high;moreover,due to the long length,the network can not fully correlate the text context information,resulting in the incomplete feature extraction and the low accuracy of model classification.To solve this problem,this paper proposes a long text classification model based on global feature extraction.In this model,the long text is randomly divided into different parts,and the convolution neural network is used to extract the local features of each part.Then,the long-term memory network is used to associate the features of each part,reduce the input dimension of the network,and retain as much detail information as possible as the whole text for classification.In the actual research,the overall coding requires high hardware equipment.In order to further save resources and improve the operation efficiency of the network,human will selectively select a part of the long text as input,and ignore other content,and use the local text of the input data as the classification basis,which requires that the extracted local text contains the important information of the long text,and requires the network to be able to input In the case of incomplete input information,the text feature representation is constructed accurately.To solve this problem,this paper proposes a long text classification model based on local feature extraction.In this paper,an improved hard attention algorithm is proposed to accurately locate the text paragraphs with important information in the long text as the input of the feature extraction model;a hierarchical feature extraction model is established to gradually extract the text features from words to sentences and then to paragraphs;the soft attention mechanism is applied to the word,sentence and paragraph levels respectively,so that it can be used in the construction of document feature representation Distinguish the important content of each level.In this paper,two data sets are collected,and experiments show that the two models can effectively and accurately distinguish long documents with similar directions.
Keywords/Search Tags:deep learning, long document classification, attention mechanism
PDF Full Text Request
Related items