With the advent of the information age,Internet technology has become more mature and widely used in every field of social life.The rapid increase of Internet users has resulted in a large amount of short text data,which contains a lot of valuable hidden information.How to classify and extract valuable information from the large amount of short text data has gradually become a research hotspot in the field of NLP.Short text has the characteristics of fewer feature words and large amount of data.Traditional text classification methods do not work well in short text classification tasks.At present,the task of short text classification mainly faces the following challenges:(1)Sparsity: Text is short and has less valid information,which results in sparse contextual features and makes it difficult for the model to obtain sufficient feature information.(2)Irregularity: The spoken short text in the network is very serious.Mistyping,polysemy and homophonic substitution often occur,which makes the short text data contain a lot of noise.In order to overcome the difficulties faced by the task of short text classification and achieve efficient use of short text data.This thesis makes a comprehensive analysis of short text classification tasks,and improves the short text classification model from three aspects: diversity,globality and key to overcome the shortcomings of the existing short text classification model.The main work is as follows:(1)A capsule network model,SMAC,based on multi-granularity attention mechanism,is proposed to solve the problem of the discretization and sparseness of short text information features.The model codes short text through stacked void convolution structure,builds multi-granularity semantic features of short text hierarchically,and optimizes the model’s ability to extract local and global features of short text.At the same time,in order to make efficient use of the multi-granularity semantic features extracted from the model,a multi-granularity attention mechanism based on void convolution is designed to accurately capture the multi-scale features of short text through the attention between different scale features.Finally,the obtained multiscale features are used as the initial capsule layer input to the capsule network for text classification tasks.The validity of this method is verified on seven datasets of text classification tasks.(2)For the lack of context information and irregular expression of short text,a short text classification algorithm based on the semantic expansion of keywords,KW-CNN,is proposed.The model expands the feature of the keywords in the input text,and then uses different algorithms including gated convolution and TextCNN to process the keyword expansion feature matrix and the original text feature matrix respectively.Finally,the short text is modeled by a multi-channel convolution neural network model,which solves the sparse feature problem of the short text.Experiments show that the KW-CNN model achieves the best results in all three datasets,which proves the validity of the model. |