Font Size: a A A

Feature Selection And Feature Representation Text Classification Based On Convolutional Neural Networks

Posted on:2020-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:M Y GaoFull Text:PDF
GTID:2428330596473804Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Mobile communication has undergone a process of development from 1G to 4G.At present,5G is flooding like the tide.The Internet generates huge amounts of data including text,pictures and audio and video at every moment,and stores it in cloud servers or personal computers and mobile communications.On the device.How to obtain valuable information efficiently and quickly is the most concerned issue for Internet users.Therefore,network technology that can intelligently and automatically classify massive information and eliminate worthless or unhealthy information has become a hot research field.As one of the research hotspots in the field of natural language processing,text categorization is of great significance for optimizing the network environment and processing massive text information.This paper aims to improve the accuracy of text classification and shorten the training time of text classification model.The main research contents are as follows:1.In this paper,the performance test of the current mainstream word segmentation tools is firstly carried out.The accuracy of the word segmentation results and the time of word segmentation are used as the basis for judging.The Jieba word segmentation tool is used to segment the text.There is a wide variety of stop words in the documentation,and the open source vocabulary has its own merits.This article has reorganized a set of stop words.It lays a good working foundation for text preprocessing.2.This paper studies four traditional feature selection algorithms: Document Frequency(DF),Chi-Square Test(CHI),Mutual Information(MI),and Information Grain(IG).Aiming at the "low frequency word defect" of CHI feature selection algorithm,an improved method is proposed from the perspectives of word frequency and class dispersion,and experiments are carried out on the naive Bayes classifier.The improved average accuracy of the CHI-M feature extraction algorithm is 87.49%,and the recall rate is 86.73%.The average classification accuracy and recall rate before the improvement are 4.88% and 4.94% respectively,which verifies the effectiveness of the improved algorithm.3.Text feature representation is an important part of the text classification task.This paper first focuses on the LDA theme vector model based on probabilistic model and the word2 vec word vector model based on neural network,and trains the important parameters of the two models,then considers the two aspects from semantic expression and word meaning combination.A new text feature representation model LDA-word is designed for text feature representation.4.In order to verify the validity of LDA-word text feature representation model and break the limit of traditional machine learning classification accuracy rate,this paper realizes text classification through Convolutional Neural Networks(CNN)in deep learning.At the same time,in order to speed up the convergence of the model,the ReLU activation function is used in the convolutional layer.Secondly,the Dropout strategy is used to weaken the over-fitting phenomenon of the convolutional neural network model.Finally,the Sigmoid function is introduced in the output layer to improve the stability of the model output.5.This paper supervises the training process of three text feature representation models through the tensorboard visualization tool in the deep learning framework TensorFlow,and uses LDA topic vector model,word2 vec word vector model and LDAword model for text feature representation,and then input CNN to implement text.classification.The experimental results show that the classification results of the LDAword model proposed in this paper are significantly improved in accuracy and recall rate,and the training time after the training corpus is input into CNN is compared with the LDA theme vector model and the word2 vec word vector.The models were increased by 0.71 times and 1.56 times,respectively.
Keywords/Search Tags:text categorization, feature selection, LDA, word2vec, convolutional neural network
PDF Full Text Request
Related items