Font Size: a A A

Design And Implementation Of Long Text Classification Algorithm Based On Deep Neural Network

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:W H XuFull Text:PDF
GTID:2428330614963928Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text classification is one of the basic techniques in natural language processing.Many studies are related to text classification,such as news classification,question answering system classification,and movie review classification.It is inefficient to rely on manual text classification,so using existing computer technology for automatic text classification is a very important research direction.This thesis mainly studied two methods of text classification,one is text classification based on traditional machine learning methods,and the other is using deep learning methods.Firstly,this thesis designed two classifiers based on traditional machine learning methods,Naive Bayesian Classifier and SVM Classifier.Computers cannot process text,so text needs to be represented as vectors before they can be classified.These two classifiers use the word frequency mapping method based on the bag of words and the TF-IDF feature representation method.The verification was performed on the Sogou dataset and the Sohu dataset,and it was found that using the SVM classifier combined with TF-IDF features works best.This model achieved 89% accuracy.on both datasets.Then,this thesis used Bi LSTM to design two text classification models,one is to use a standard Bi LSTM network for text classification,and the other is the classification model combining Bi LSTM and attention mechanism.The text representation method used one-hot representation method and skip-gram-based word embedding method.Finally,experiments were performed on two data sets.It was found that the classification model combining Bi LSTM and attention mechanism and using word embedding classification method works best.The accuracy rate on the Sogou dataset is 96%,and the accuracy rate on the Sohu dataset is 90%.Finally,this thesis used CNN for text classification.There are three text representation methods,one-hot representation method,word embedding method,and BERT pre-trained word vector method.The three text representation methods are combined with CNN to perform experiments on two data sets.The final experimental data showed that,on the Sogou dataset,the effect of CNN with word embedding and BERT was the best,and the accuracy rate can reach 97%;on the Sohu dataset,the effect of combining CNN with one-hot representation was the best,reaching 97% accuracy.This is because the Sohu data set is relatively small,and distributed word vectors do not perform well under fractional aggregation.The Sogou data set is relatively large,so using of distributed vectors will be better than one-hot distribution.So on small data sets,try to avoid using distributed word vectors.
Keywords/Search Tags:Text classification, text representation, word embedding, deep learning
PDF Full Text Request
Related items