Research On Group Classification Technology Based On Chat Content

Posted on:2021-05-07

Degree:Master

Type:Thesis

Country:China

Candidate:X Feng

Full Text:PDF

GTID:2428330629951027

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the advent of social media,short text chats based on communication platforms such as QQ and WeChat are very popular on today's networks.How to infer topics from a large number of short chat texts and classify them accurately is a critical and challenging task for many content analysis tasks.Traditional short text classification is based on feature words and their models to form a classification model,which ignores the interconnection between words,and the use of traditional word frequency statistics and vector space model for short text processing,due to the sparseness of short text The result of the classification is not accurate.Aiming at the characteristics of short text semantic sparseness,the current common method is to use external corpus to deal with it,but the huge external corpus theme will lead to poor performance in the process of algorithm implementation.At present,the commonly used machine learning classification algorithms,such as SVM(Support Vector Machine),Naive Bayes,etc.,need to be improved in the field of short text,and the algorithm requires a large amount of data in the implementation process,and its accuracy not good.In response to these shortcomings,this article improves the BTM(Biterm Topic model)topic model and proposes a TTR-BTM(Time TextRank Biterm Topic model)topic model.This model introduces time impact factors and intercepts more valuable data through time windows.That is,for each piece of information in the time window,TextRank keywords are extracted,and the extracted keywords are used as the input of the BTM topic model to output the distribution of topic words.Based on this,the group classification is performed: firstly label the data based on the results of the topic word distribution,complete the preparation of the training set and the validation set,and then introduce a classification algorithm.Based on the output results of the TTR-BTM topic model,combined with Word2 vec The topic words are extended with feature words,and FastText improved algorithm W-FastText(Word2vec-FastText)is used.By calculating the average value of the input sequence word vector,the term's discrimination will be higher,and the accuracy of the short text of the chat will be completed classification.Finally,the paper compares the TTR-BTM topic model proposed in this paper with the topic models including BTM and LDA.The comparison dimensions include the effect analysis of topic clustering,topic continuity analysis,and confusion analysis.At the same time,W-FastText classification algorithm is compared with classification algorithms including FastText,SVM and Naive Bayes from the dimensions of classification accuracy and model training time.The experimental results show that the text proposed TTR-BTM topic model and W-FastText classification algorithm have better topic extraction and group classification effects respectively.

Keywords/Search Tags:

FastText, TextRank, BTM, short text classification, Word2vec

PDF Full Text Request

Related items

1	Research On Short Text Automatic Summarization Algorithm Based On TextRank And Word2Vec
2	Research On The Method And Its Application Of Short Text Classification Based On FastText
3	Research On Chinese Short Text Classification Based On Improved FastText
4	Research On Chinese Text Classification Based On Improved FastText
5	Research On FastText Text Classification Algorithm Based On TF-IDF
6	Research On Chinese Short Text Classification Based On Word Embedding
7	Research On Emotional Classification Based On Short Text(Sentence Level)
8	Research On Short Text Emotion Classification Method Based On Word2Vec And N-Gram
9	Chinese Short Text Analysis Based On Word2vec
10	Research On Fast And Precise Classification Algorithm Of Long Text Based On FastText