Font Size: a A A

Research On A Text Classification Method Based On The Concatenated Of Word Vector And Doc2vec

Posted on:2022-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:L M S A B L M T AFull Text:PDF
GTID:2518306353467944Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,text data has increased dramatically.Efficient query and effective management of text have become particularly important,and text organization needs to be organized according to text categories.The category imbalance generated by the data will become more significant as the amount of data increases,resulting in a significant gap between the number and types of samples of different categories in training.The category with a small sample size lacks word feature representation,and the minority category is incorrectly predicted as the majority sample category when making predictions,which will greatly affect the classification effect.Due to the advantages of automatically learning data features and classification,deep learning has gradually become the mainstream research method of natural language processing,but each model and method has the problem of incomplete description of text content and ignorance of key information.Since the structure of each classifier model is different,a suitable classifier model should be selected for classification.This article mainly conducts research from the feature processing of text content and the selection of classifiers.In the feature processing of text content,in order to solve the problem of obtaining text word order information and sentence structure information,this thesis designs a fusion TFIDF and Doc2 vec word vector splicing model.The paragraph vector and word vector obtained through the training of the Doc2 vec model represent the overall content of the text.The TFIDF model can be used to extract the key keywords of the text to represent the key information of different categories of text.The splicing and fusion text representation of the Doc2 vec and TFIDF models covers A more comprehensive textual information.In the classifier selection,the bidirectional gated recurrent unit(BiGRU)is used as the basic classifier in the text classification task.The text features can be extracted from the positive and negative directions of the input text features to ensure that the text semantics can be effectively understood.By introducing the attention mechanism to strengthen the weight of key features,it solves the problems of bidirectional gated loop unit,time series averaging,and feature burying.Experimental comparison shows that the overall classification accuracy of the model proposed in this thesis has reached 91.2%,which is 13.8% higher than the text vector generated by TFIDF combined with CNN and SVM as the classifier.The algorithm proposed in this thesis also has a certain improvement in the accuracy of a small number of samples.The accuracy of the smallest category has increased by 10.2%,and the recall rate has increased by 5.6%.The model proposed in this thesis has achieved good classification results.
Keywords/Search Tags:Text classification, word vector, Dov2vec, BiGRU, TFIDF, Word Concatenated
PDF Full Text Request
Related items