Font Size: a A A

Research On Semi-supervised News Text Classification Method Based On Deep Learning

Posted on:2022-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:R XiangFull Text:PDF
GTID:2518306530498124Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In today's era of big data,news text information is showing an explosive growth trend and massive news text information is disorderly.In the face of explosive growth and disorderly news text information,it is more difficult and time for users to obtain the required information.Text classification technology can organize and manage massive news text data scientifically and effectively.At present,with the breakthrough and development of deep learning in natural language processing and other aspects,various supervised deep learning algorithms have been used in news text classification.Supervised deep learning algorithms can effectively extract the features of the data for classification and get the significant classification effect,but they require a large amount of labeled data.However,in actual situations,the labeled data is not easy to obtain.Semi-supervised classification requires only a small amount of labeled data and can use unlabeled data to improve the classification effect.Therefore,this thesis will be combine deep learning and semi-supervised learning to classify news texts.The main research work of this thesis is as follows:1.Construct a semi-supervised news text classification framework based on deep learning.The framework includes news text acquisition module,text preprocessing module,semi-supervised news text classification module and testing module.These four modules respectively carry out news text data acquisition,news text word segmentation,stop words removal,word vectorization preprocessing,semi-supervised news text classification and test classification effect.2.Give a method of crawling news text data.This thesis uses scrapy framework to crawl news text data from news web pages,including news headlines and content information and uses Xpath to parse the positioning data.For the crawled data,use regular expressions for data cleaning,and then save it.During the crawling process,set the UserAgent to pretend to be a web browser to avoid the anti-crawler mechanism.3.Give a text word segmentation method.On the basis of Bi-GRU-CRF combined with a parallel Bi-GRU-attention to integrate dictionary information for word segmentation training.One Bi-GRU neural network is used to extract feature information,and the other Bi-GRU-attention neural network obtains dictionary information by constructing feature vectors.Combine the two parts of information and input it into the conditional random field for word segmentation.4.Give a semi-supervised news text classification method based on deep learning.This method combines pseudo-label learning and deep learning and can use unlabeled data and a small amount of labeled data for news text classification.First,use a small amount of labeled data to train the Bi-GRU classification network,then use the trained network to predict the unlabeled data,and use the prediction result as a pseudo-label.After obtaining the pseudo-labels of the unlabeled data,the neural network Bi-GRU extracts the features of the labeled data and the unlabeled data with pseudo-labels,and then uses the fully connected layer to reduce the dimensionality of the feature vector,and then uses the softmax function for classification.In the training process,the temporal ensembling algorithm is used to integrate the pseudo-labels of multiple iterations to improve the accuracy of the pseudo-labels;the Dropout mechanism is used to avoid overfitting during the training process.5.Give experiments and conduct comparative analysis.This thesis uses the development tool Pycharm and the deep learning framework Tensorflow and Pytorch to conduct comparative experiments.On the public data sets PKU and MSR,compare with mainstream word segmentation algorithms to test the effectiveness of the word segmentation algorithm given in this thesis.On the public data set cnews and the data set znews crawled by the crawler algorithm given in this thesis,adjust the ratio of labeled data and unlabeled data,and compare and test the semi-supervised news given in this thesis with mainstream news text semi-supervised and supervised classification algorithms.The effectiveness of text classification algorithms.The results of comparative test experiments show that the word segmentation algorithm given in this thesis has better performance than the current mainstream neural network word segmentation algorithm,and has a certain degree of improvement in various evaluation indicators;the semisupervised news text classification algorithm is compared with the mainstream semisupervised classification algorithm and the supervised classification algorithm.The classification effect fully demonstrates the effectiveness of semi-supervised classification algorithms given in this thesis.The results of multiple sets of comparative experiments show that the semisupervised news text classification algorithm based on deep learning given in this can effectively reduce the need for labeled data,while effectively classifying it,and also allowing users to quickly and accurately obtain the news they need.Information and classification results can even be used in areas such as personalized news recommendation and news information retrieval.
Keywords/Search Tags:Deep learning, Semi-supervised learning, Word segmentation, News text classification
PDF Full Text Request
Related items