Font Size: a A A

Combining Topic Model And Word Embedding For Short-Text Classification

Posted on:2020-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ShaoFull Text:PDF
GTID:2428330602452147Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the further development of Internet technology,especially the popularity of mobile devices,the way people learn and live has been changing constantly,and the number of data generated by the Internet has increased rapidly.To adapt to the fragmentation scenarios of mobile terminals,short text content information in the form of network news,commodity reviews and so on has become the main presentation form of text data in Internet information content.Faced with massive short text data,effective classification of short text can not only effectively reduce the size of data,accurately understand the content of text information,but also play a vital role in news push,public opinion monitoring and other fields.Short text data has the characteristics of less vocabulary and less intensive information units.It is difficult for traditional long text classification methods to directly achieve good classification results.To solve this problem,this paper proposes a short text categorization method which combines topic model and word vector model.The improved TF-IDF model and word vector model are used to construct the category keyword set,and the category recognition degree of extended words is judged by using the category keyword set.Finally,the content of short text is expanded by calculating the cosine similarity of the word vector.The LDA model is used to construct the topic distribution set of categories,and to extend the representation of vocabulary vectors in short texts under the topic granularity.Because category features are added to text extension,the method proposed in this paper avoids the ineffective expansion and improves the effectiveness of short text extension.In the text categorization stage,this paper improves the text categorization method based on the deep learning network Text CNN.The weight of the convoluted feature map is modeled to enhance the ability of the convolutional neural network to acquire short text features.Experiments show that the proposed short text categorization method can improve the accuracy and recall rate of text categorization on different lengths short text datasets.To solve the problem of low accuracy and sparse content of text representation in semi-supervised short classification,this paper introduces the content expansion method of short text and the improved vocabulary vector representation under semi-supervised conditions.The self training classification method and collaborative training method are optimized.Experiments show that the proposed text content expansion method and the improved vocabulary vector can have a positive impact on the semi-supervised classification method and obtain better short text classification results.
Keywords/Search Tags:Short-text Classification, LDA, Word Embedding, Deep Learning, Semi-supervised learning
PDF Full Text Request
Related items