Research On Chinese Text Classification Based On Keyword Strategy And CNN

Posted on:2020-12-08

Degree:Master

Type:Thesis

Country:China

Candidate:D Y Chen

Full Text:PDF

GTID:2428330611494485

Subject:Electrical engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development and maturity of Internet technology,the speed of the generation and distribution of various digital information has exploded,with a large proportion of text data.How to perform automatic and fast classification in massive text data has become a subject worthy of further study.The traditional artificial text classification method is to construct a classifier according to rules in a manual manner,which can no longer cope with the current amount of data.In recent years,with the rapid development of deep learning technology,due to its powerful representation capabilities,it is possible to better extract the main information in text and achieve excellent results in text classification.Therefore,this paper uses deep learning to study the data preprocessing,text feature representation and classifier models in Chinese text classification,and proposes a new framework.The specific research content and results are as follows:This paper first introduces the related theories of text classification,mainly from the aspects of text classification definition and process,text pre-processing,Chinese text feature vector representation model,feature word extraction algorithm,and so on.Secondly,in order to solve the problems of noise and sparse features in Chinese text,it is necessary to remove useless feature words before inputting the classification model,and propose a Chinese text classification framework based on keyword strategies and convolutional neural networks.In this framework,first construct a word vector model based on Word2Vec,and then use Segmentation Term Frequency-Document Frequency(STF-DF)to filter out keywords with strong class discrimination ability as a sample feature word set.Effectively remove sample useless feature words to obtain more accurate text feature representations;on this basis,a Convolution Neural Network(CNN)suitable for Chinese text classification is constructed for classification.The experimental results show that the accuracy of the framework in Chinese text data sets of THUCNews and Fudan University has achieved 94.51%and 95.04%,respectively.At the same time,a recall rate of 99.70%has been achieved in the real harmful information data set,which validates the effectiveness of the proposed framework.Finally,for the problem of low recognition rate of a few categories in the text imbalanced data set,the optimization is performed from the feature word extraction algorithm and the target loss function:in the extraction method,the chi-square statistics(CHI Square,CHI)and TF-IDF algorithm is improved,and put forward a new kind of CHI-TF-IDF key extraction algorithm,by raising a few categories of key for selecting high priority weights,avoid the loss of the characteristic information to improve classification accuracy,the proposed algorithm in all kinds of unbalanced data set obtained the good effect,which,in THUCNews data set to build binary classification of imbalanced data sets experiment,F1 value was 2.56%higher than CHI algorithm;In terms of the target Loss function,the Focal Loss function applied in the field of images was applied to text classification and the selection of super parameters was carried out to improve the classification performance of unbalanced data sets to some extent.Experimental results show that the improved method in this paper can improve the recognition rate of a few categories in both binary classification and multi-classification,among which,under the constructed unbalanced data set of THUCNews binary classification,the macro F1 value(the average value of all kinds of F1 values)increases by 2.55%.

Keywords/Search Tags:

Text classification, feature word extraction, Convolutional Neural Network, unbalanced data set, Focal Loss

PDF Full Text Request

Related items

1	Research On News Text Classification Based On Convolutional Neural Network
2	Research And Implementation Of Text Classification Algorithm Based On Three-way Decision And Convolution Neural Network
3	Research On Scene Text Detection And Image Classification Based On Convolutional Neural Network
4	Research On Application Of Deep Convolutional Neural Network Models For Feature Extraction And Classification
5	Research On Text Classification Based On Word Sense Disambiguation And Convolutional Neural Network
6	Chinese Short Text Classification Based On Convolutional Neural Network Combined With Word Vector
7	Research On Semantic Feature Based Text Classification Algorithom
8	Research Of Short-text Classification Method Based On Convolution Neural Network
9	Research And Implementation Of Text Sentiment Analysis System Based On Neural Network Model
10	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification