Font Size: a A A

Research On Text Classification Based On Multi-factor Features

Posted on:2020-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:W LuFull Text:PDF
GTID:2428330596976499Subject:Engineering
Abstract/Summary:PDF Full Text Request
Traditional methods for classifying e-commerce reviews use word bag models or simple text extraction TF-IDF feature methods to classify comment texts using SVM or traditional machine learning models.In recent years,attempts have been made to use static word vectors such as Word2 Vec word embedding as a single.It means that the text is classified by using the neural network model such as LSTM.Although it has certain effects,it still cannot meet the accuracy requirements of users and merchants.In this thesis,many attempts have been made on word vector representation and classification models,and many novel ideas have been proposed and implemented one by one as follows:(1)The text adds the TF-IDF feature extracted from the text information to the nontext feature based on the non-text feature as an additional feature attached to the non-text information,and uses the Lightgbm model and the logistic regression model to perform the feature.Classification,the Lightgbm model works better and is an important part of the multidimensional feature factor model.(2)Compared with the previous single use of a word vector as a representation of words,I use both Wor2 vec and GloVe word vectors as the representation of the text,and uses the mixed word vector model in the same Chinese classification model.The use of a single word vector F1 value and ROC value increased by about 1.7%.(3)Using a static word vector a word corresponds to a fixed word vector,which will cause ambiguity.The text uses the Elmo language model to generate dynamic word vectors.The dynamic word vector uses a word in different contexts depending on the context.Map to different vectors.Text experiments verify the advantages of using Elmo dynamic word vectors compared to using static word vectors.Using the Elmo dynamic word vector can increase the F1 and ROC values by about 1%.(4)Using the Transformer model as the text classification model,and comparing the Transformer and LSTM classification effects,the Position Encoding position information is added to the LSTM input word vector,which proves the validity of the Transformer model classification,and proposes a multi-dimensional factor.The text classification model of the feature combines the TF-IDF feature and the non-text information constructed by the comment text information,and uses the Lightgbm model to classify,and uses Transformer as the classifier based on Elmo to generate the dynamic word vector,and performs the two models.Fusion,which constitutes a text classification model based on multi-factor features.In the experiment,the accuracy of the text classification model based on multi-factor features is proved.The F1 value and ROC value can reach 0.94 or above.
Keywords/Search Tags:Text Classification, Word Embedding, Word2Vec, Elmo, Transformer, Model Fusion
PDF Full Text Request
Related items