Font Size: a A A

Research On Group Classification Technology Based On Chat Content

Posted on:2021-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:X FengFull Text:PDF
GTID:2428330629951027Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the advent of social media,short text chats based on communication platforms such as QQ and WeChat are very popular on today's networks.How to infer topics from a large number of short chat texts and classify them accurately is a critical and challenging task for many content analysis tasks.Traditional short text classification is based on feature words and their models to form a classification model,which ignores the interconnection between words,and the use of traditional word frequency statistics and vector space model for short text processing,due to the sparseness of short text The result of the classification is not accurate.Aiming at the characteristics of short text semantic sparseness,the current common method is to use external corpus to deal with it,but the huge external corpus theme will lead to poor performance in the process of algorithm implementation.At present,the commonly used machine learning classification algorithms,such as SVM(Support Vector Machine),Naive Bayes,etc.,need to be improved in the field of short text,and the algorithm requires a large amount of data in the implementation process,and its accuracy not good.In response to these shortcomings,this article improves the BTM(Biterm Topic model)topic model and proposes a TTR-BTM(Time TextRank Biterm Topic model)topic model.This model introduces time impact factors and intercepts more valuable data through time windows.That is,for each piece of information in the time window,TextRank keywords are extracted,and the extracted keywords are used as the input of the BTM topic model to output the distribution of topic words.Based on this,the group classification is performed: firstly label the data based on the results of the topic word distribution,complete the preparation of the training set and the validation set,and then introduce a classification algorithm.Based on the output results of the TTR-BTM topic model,combined with Word2 vec The topic words are extended with feature words,and FastText improved algorithm W-FastText(Word2vec-FastText)is used.By calculating the average value of the input sequence word vector,the term's discrimination will be higher,and the accuracy of the short text of the chat will be completed classification.Finally,the paper compares the TTR-BTM topic model proposed in this paper with the topic models including BTM and LDA.The comparison dimensions include the effect analysis of topic clustering,topic continuity analysis,and confusion analysis.At the same time,W-FastText classification algorithm is compared with classification algorithms including FastText,SVM and Naive Bayes from the dimensions of classification accuracy and model training time.The experimental results show that the text proposed TTR-BTM topic model and W-FastText classification algorithm have better topic extraction and group classification effects respectively.
Keywords/Search Tags:FastText, TextRank, BTM, short text classification, Word2vec
PDF Full Text Request
Related items