Font Size: a A A

Sentiment drift and its effect on the classification of Web log posts

Posted on:2009-09-19Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Durant, Kathleen TFull Text:PDF
GTID:2448390005958938Subject:Computer Science
Abstract/Summary:
Sentiment classification separates a collection of opinionated text into two opposing classes: favorable and unfavorable. It has been successfully applied to online product comments and movie reviews. Previous studies have shown that topic, domain, and time influence the results of machine learning models used to classify sentiment. This thesis furthers the investigation of time on sentiment classification. It defines the phenomenon of sentiment drift: the change of sentiment over time. We create a topic-specific corpus and demonstrate a change in sentiment over specific time periods. The source of the corpus is web logs; we find it to be more difficult to classify than previous studied corpora.; Previous work has shown that factors such as machine learning induction technique, class composition, dataset size and feature selection all influence predictability. We show models with configurations that maximize predictability under these factors are still influenced by time. The most successful configuration we found is a collection of Naive Bayes models with applied feature selection and a balanced class composition. The collection on average, predicts the sentiment of a web log post 89.77% of the time.; We perform collections of sentiment classification experiments varying the difference (in months) between the testing and the training period calling it the testing-training difference (TTD). We show as the TTD increases the predictability of the sentiment model decreases. Models trained on months chronologically closer to the training month significantly produce higher accuracies. We also show models trained on future data significantly outperform models trained on past data. We investigate statistical subsets of the models and show that each subset is influenced by the TTD.; We show that models that incorporate the influence of time produce higher predictability. We find, for example, ensemble models that define a weight based on the TTD produce higher predicatibility than those that do not ([2.176, 5.092] alpha-level .05). The findings show 3-month ensembles outperform the 5-month ensembles ([.39 alpha-level .05]), indicating component models created more than three months from the testing examples decrease the results of an ensemble.
Keywords/Search Tags:Sentiment, Classification, Models, Web, TTD
Related items