Font Size: a A A

Research On Chinese Text Classification Based On Keyword Strategy And CNN

Posted on:2020-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:D Y ChenFull Text:PDF
GTID:2428330611494485Subject:Electrical engineering
Abstract/Summary:PDF Full Text Request
With the continuous development and maturity of Internet technology,the speed of the generation and distribution of various digital information has exploded,with a large proportion of text data.How to perform automatic and fast classification in massive text data has become a subject worthy of further study.The traditional artificial text classification method is to construct a classifier according to rules in a manual manner,which can no longer cope with the current amount of data.In recent years,with the rapid development of deep learning technology,due to its powerful representation capabilities,it is possible to better extract the main information in text and achieve excellent results in text classification.Therefore,this paper uses deep learning to study the data preprocessing,text feature representation and classifier models in Chinese text classification,and proposes a new framework.The specific research content and results are as follows:This paper first introduces the related theories of text classification,mainly from the aspects of text classification definition and process,text pre-processing,Chinese text feature vector representation model,feature word extraction algorithm,and so on.Secondly,in order to solve the problems of noise and sparse features in Chinese text,it is necessary to remove useless feature words before inputting the classification model,and propose a Chinese text classification framework based on keyword strategies and convolutional neural networks.In this framework,first construct a word vector model based on Word2Vec,and then use Segmentation Term Frequency-Document Frequency(STF-DF)to filter out keywords with strong class discrimination ability as a sample feature word set.Effectively remove sample useless feature words to obtain more accurate text feature representations;on this basis,a Convolution Neural Network(CNN)suitable for Chinese text classification is constructed for classification.The experimental results show that the accuracy of the framework in Chinese text data sets of THUCNews and Fudan University has achieved 94.51%and 95.04%,respectively.At the same time,a recall rate of 99.70%has been achieved in the real harmful information data set,which validates the effectiveness of the proposed framework.Finally,for the problem of low recognition rate of a few categories in the text imbalanced data set,the optimization is performed from the feature word extraction algorithm and the target loss function:in the extraction method,the chi-square statistics(CHI Square,CHI)and TF-IDF algorithm is improved,and put forward a new kind of CHI-TF-IDF key extraction algorithm,by raising a few categories of key for selecting high priority weights,avoid the loss of the characteristic information to improve classification accuracy,the proposed algorithm in all kinds of unbalanced data set obtained the good effect,which,in THUCNews data set to build binary classification of imbalanced data sets experiment,F1 value was 2.56%higher than CHI algorithm;In terms of the target Loss function,the Focal Loss function applied in the field of images was applied to text classification and the selection of super parameters was carried out to improve the classification performance of unbalanced data sets to some extent.Experimental results show that the improved method in this paper can improve the recognition rate of a few categories in both binary classification and multi-classification,among which,under the constructed unbalanced data set of THUCNews binary classification,the macro F1 value(the average value of all kinds of F1 values)increases by 2.55%.
Keywords/Search Tags:Text classification, feature word extraction, Convolutional Neural Network, unbalanced data set, Focal Loss
PDF Full Text Request
Related items