Font Size: a A A

Research On Text Sentiment Classification Based On Feature Selection And TFIDF

Posted on:2021-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:H ZengFull Text:PDF
GTID:2518306575953679Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of the Internet has brought new opportunities and challenges to all walks of life.It has also provided an information exchange platform for users all over the world.People are getting used to expressing their opinions and comments on the Internet.With the continuous and rapid development of Internet applications,massive text content is stored and circulated on the network,including product reviews,personal microblogs,hot public opinion and other information.These texts contain some valuable information.As manual sorting has been difficult to adapt to the rapidly growing text information scale,how to extract valuable information efficiently and accurately has become a new research topic,and sentiment classification is one of the typical problems.The research content of this paper is to use machine learning to conduct sentiment classification research on Chinese text.First of all,text data should be preprocessed,including text cleaning,word segmentation and removal of stop words.Since the text information cannot be directly used for classification,so need to convert text into structured data.The text representation method used in this paper is Vector Space Model,which takes the words in the text as the dimension of features.All the words in the corpus constitute a vector space,and each text can be represented in the form of feature vectors.The presence of a large number of words in the corpus results in the high dimension and sparsity of text features,so it is necessary to reduce the dimension of text features to reduce noise in text data and improve the generalization ability of classification models.In this paper,a feature selection method based on information gain is used to extract feature subsets that have a high degree of contribution to classification from high-dimensional features.In this way,feature dimensions can be reduced effectively.In addition to feature selection,the TFIDF method is also used in this paper to weight features,so that the importance of features can be quantified to distinguish so as to improve the accuracy of classification.Finally,using Logistic Regression and Support Vector Machine,six comparative experiments were made by combining three different processing features and two classification models.In the end,the results of the six experiments were evaluated and analyzed through precision,recall,f1-score and classification time,and it was found that feature selection and TFIDF weighting could reduce the difference in the predictive performance of different categories of the classifier and improve the generalization ability of the model.The f1-score in Logistic Regression and Support Vector Machine are 0.87 and 0.88 respectively,which have good performance and meet the experimental expectations.
Keywords/Search Tags:Feature selection, Information gain, Sentiment classification, Machine learning
PDF Full Text Request
Related items