Font Size: a A A

Research On Long News Texts Representation And Classification Method Based On Network Model Fusion

Posted on:2021-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:W S QinFull Text:PDF
GTID:2428330602987159Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the improvement of computer technology and the arrival of the big data era,Text Classification is playing an big and big role as an important branch of natural language processing.It has a wide range of applications in personalized recommendation,data mining,information retrieval,etc.After years of research and development,a complete system of Text Classification technology has been found.Whether it is based on traditional machine learning or deep learning,researchers have proposed many related theoretical innovations and practical applications.However,while the explosive growth of text information promotes the development of Text Classification technology,it also faces many challenges,especially for long texts.The current research on representation methods of long texts either has a serious problem of information loss or a problem of too high dimensions.The accuracy and stability requirements of Text Classification are also getting higher and higher,so there are still a lot of researches to be made in Text Classification era,there are many aspects waiting for us to study.This dissertation focuses on the classification of long news texts.The research goal is to classify the subject of the news texts.After the data preprocessing operations such as removing noisy words,word segmentation,and removing stop words,this dissertation uses chi-square test and Word2 Vec to obtain the vector representation of long news texts,which solves the problems of information loss and high dimension caused by the traditional methods of representing long texts.Then,this dissertation trains a neural network model based on the fusion of Convolutional Neural Network and Gated Recurrent Unit with attention mechanism.This dissertation mainly uses the THUCNews dataset as the experimental data set,and the effectiveness of this model is verified through experiments.The innovations of this dissertation are as follows:(1)In terms of text representation,this dissertation proposes a long texts representation method W2V-CHI combining CHI and Word2 Vec.Most of the current text classification methods are for short texts,but for long news texts,the usual approach is to truncate the long texts first,and then use the same methods as the short texts,which will inevitably lead to the lack of text information and truncation,and there will be quite a few features that have no or little impact on the classification after the retention,which affects the classification performance.To solve this problem,this dissertation comprehensively considers the advantages of chi-square test in feature extraction and Word2Vec's excellent word vector representation ability,and proposes a long texts representation method W2V-CHI combining CHI and Word2 Vec.The main idea of this method is to first perform chi-square test on each word feature,the word features that meet the test criteria are retained,the word features that do not meet the criteria are discarded,and then the reserved word features are represented by Word2 Vec word vectors.This not only avoids the violent truncation of long texts processed by traditional methods,but the resulting text word vectors contain more semantic information and lower dimensions.Experiments prove the effectiveness of this text representation method.(2)This dissertation proposes a hybrid network model MLCNN & Bi GRU-ATT based on the horizontal fusion of CNN and GRU.With the development of deep learning,a large number of CNN and RNN models are used for Text Classification and have achieved fruitful results.They have their own advantages in processing text data.First of all,GRU,as a variant of RNN,is now widely used in natural language processing tasks.It can easily capture the contextual information of the text,which has natural advantages in processing time series,and its model structure is relatively simple,the requirements for computing resources are not very high;Secondly,CNN has obvious advantages in extracting local features of text,it can make the information of the text more abundant.Finally,the text is composed of words,and the influence of different words on the classification is very different.A common method to reflect this difference is to calculate the importance of words to the classification through the attention mechanism,and assign different weights accordingly to highlight the contribution of keywords to the classification,ignore or reduce the role of irrelevant words.Based on the respective advantages of the above GRU and CNN,this dissertation takes the text representation obtained by the W2V-CHI method as input,and proposes a classification model that includes a multi-layer CNN and a two-layer GRU with Attention mechanism.This model not only has powerful learning capabilities,but also can extract deep semantics that take into account both local and global textual information.Experiment results show that compare with traditional text classification model,our model has achieved a higher classification accuracy on THUCNews data set and Sougou CS data set.
Keywords/Search Tags:Text Categorization, Deep Learning, Chi-square test, CNN, GRU
PDF Full Text Request
Related items