Font Size: a A A

Topic Categorization Of Short Text Sequences

Posted on:2020-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:W Y ShenFull Text:PDF
GTID:2428330596968996Subject:Public Security Technology
Abstract/Summary:PDF Full Text Request
In order to solve the problems of information sparseness and topic variability of short texts,and improve the performance of topic categorization of short text sequences,this thesis takes the short text sequences as the research object and proposes several approaches for key steps in the topic categorization of short text sequences by analyzing relevant machine learning and natural language processing technology.The main work and conclusions of this thesis are as follows:Aiming at the sparseness of short text sequences,this thesis draws on the idea of bootstrap sampling in statistics and proposes a semantic-distance-based data augmentation method for short text sequences.The method expands features of short texts by computing text similarity and word-word distance.Compared with several kinds of text data augmentation methods,the method proposed provides more accuracy of topic classification and proves that without external application and knowledge,the use of corpus itself also can augment short text effectively,allowing the classifier to learn more features and improving the generalization of the models.Aiming at the problem that the topic of short texts is open,this thesis proposes an out-ofdomain topic detection method based on an Autoencoder.This method sends the representation vector of a short text pre-trained in the classification task into the Autoencoder,and the value of the reconstruction loss is used to detect the short texts of out-of-domain topics.The experiment demonstrates that the algorithm proposed outperforms several on-class classification algorithms,and also proves that the Autoencoder can not only be used for anomaly detection in the field of image processing and video surveillance but also can be used for the detection of out-of-domain topic text.Aiming at the problem of short text modeling,this thesis proposes CapSA neural network text modeling method by combining deep capsule network and self-attention network.The method achieves equivariance of feature changes,preserves the dependence between longdistance words,and increases the diversity of features.Compared with several neural network models based on RNN and CNN,experimental results show that the proposed topic categorization model based on CapSA demonstrates a better performance of topic categorization of short text sequences.Finally,the thesis constructs a prototype system of topic categorization of short text sequences,and by implementation practice,verifies the application of the approaches and steps proposed in the thesis in content censorship.
Keywords/Search Tags:short text, topic categorization, data augmentation, out-of-domain detection
PDF Full Text Request
Related items