Font Size: a A A

Research On Improved TF-IDF Feature Selection And Short Text Classification Algorithm

Posted on:2021-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2428330620965585Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the Internet has quickly become an important platform for Internet users to obtain information,communication and learning.Meanwhile the platform has also produced a large amount of text data.The content of these data is short,and semantically related,and the expressions are diverse,but it contains a lot of information.How to deal with these short texts and obtain valuable information from them has always been a concern of people.Text classification refers to classify text information into one or more types,which can solve the problem of messy and unorganized short text,improve information utilization,and help user narrow the scope of information retrieval.Considering the characteristics of these unstructured text data,traditional feature representation methods and classification models directly process them,and the accuracy of the results is limited.In this case,this article mainly starts from two aspects: text feature selection method and text classification algorithm.1.In view of the imbalance of short text data sets and the inapplicability of traditional feature selection methods,this paper first introduces quasi-frequency variance and CHI into the TF-IDF algorithm to form two single-model feature selection algorithms,by merging two single-model and introducing Word2 vec formed WoTFI,which is used for feature acquisition.The model takes into account both the semantic information of text data and the difference in the distribution of feature words within and between classes.Compared with different feature representation models,WoTFI not only flexibly realizes the weight distribution of feature words,but also has a positive impact on the classification results.2.The traditional classification algorithm has been improved,using Bi-LSTM combined with dual-channel feature input CNN.WoTFI is used as a channel feature input of the model,and the other channel is a character-level feature embedding representation.Short text features are obtained by capturing word or phrase shape and morphological information,and then the CNN algorithm is used to obtain deeper features for the above two channel feature processing.Introduce LRN optimization and Dropout strategies in the pooling layer and LSTM layer,speeded up the supervised learning algorithm,prevent the model from over-fitting,and increase the generalization ability of the algorithm.The classification model integrates the advantages of CNN and Bi-LSTM models,which can capture bidirectional semantic dependencies,effectively retain the semantic information of short text,besides,avoid the problem of gradient explosion and disappearance during long sequence training.The data set of the experiment set is different in size,contains Chinese and English text,and the number of categories is different.It can be seen from the comparison experiment that the performance index of the model in this paper is better than the traditional model.
Keywords/Search Tags:short text classification, improved TF-IDF algorithm, convolutional neural network, Word2vec, Bi-LSTM
PDF Full Text Request
Related items