Font Size: a A A

Semi-supervised Learning And Active Learning Of Sentiment Classification Coupled With Domain Knowledge

Posted on:2012-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y K MaiFull Text:PDF
GTID:2218330371452383Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the internet is evolving into web 2.0, information on the internet is also transformingfrom traditional portal news into User Generated Content(UGC). Comparing to traditionportal news, information diffusion in Web2.0 era processes several distinctive characteristics:(1)unprecedentedly numbers of topics available and the volumes of information are higherthan ever before;(2)the generation of information is interactive and continuous;(3)the reach of the information can grow geometrically in a very short period of time. As a resultof these dramatic changes in information generation and diffusion, internet users are now beable to participate in various kinds of interactions and express thoughts and opinions freely,thus creating lots of texts with sentiments. When trying to classify this kind of texts, severalcharacteristics have been distinguished:(1)high dimensional input;(2)text representationis scattered;when compared to texts for traditional text classification, such as news feeds,blogs and email groups, its scattering nature become much more visible;(3)characteristicsthat can link to classification are significantly fewer than traditional texts;(4)text data arelinearly separable or approximately linearly separable. Of all 4 characteristics,(2)and(3)make it much more difficult to perform classification for sentiment texts than for traditionalones.Of current sentiment text classification studies, there are two main approaches:(1)thelinguistic classification method;(2)the statistical learning method. The linguisticclassification method borrows the heuristic rules from linguistic studies or linguistic toolsdeveloped by experts, such as a semantic lexicon, to determine the positive or negativesentiment inclination of each word in a sentiment text, and then determine the overallsentiment inclination for the whole text based on summation. On the other hand, the statisticallearning method identifies and labels the sentiment inclination of certain amount of texts first,and then uses the outcome as training data to build models using Supervised Learning orSemi-supervised Learning techniques, for sentiment classification of un-labeled text. TheLinguistic classification method requires less manpower to do the data labeling work whilefalls short of precision when compared to the statistical learning methods. On the other hand,although the statistical method requires more manpower and fund to execute the necessary data labeling work, it does achieve more precise outcome in classification. Both methods haveits advantages. However, current studies fail to capture them simultaneously.This article is going to propose a new way of thinking, which will first translate thesemantic lexicon in to a formalized priori model, and then combine with the statisticallearning method and design the Supervised Learning and Semi-supervised Learningtechniques. This is called The Statistical Learning with Knowledge Coupling Method.Through combining the linguistic classification method and the statistical learning method,the new method obtains two main advantages:(1)while reducing the labeled and un-labeleddata required in the statistical learning method to some extent, it is still able to produceclassification outcome with equal effectiveness;(2)when having a same amount of data, theclassification effectiveness of the Supervised Learning and Semi-supervised Learningtechniques can be further enhanced, with the help of text sentiment oriented linguisticresources. In order to realize the method this article proposes, first of all, an introduction willbe made, about how to transform a semantic lexicon into a priori model that can be acceptedby the Na?ve Bayes Model and the Discriminate Model(Support Vector Machine, LogisticRegression)(Chapter 3), and argues theoretically in the meantime that such a formalizedlinguistic knowledge is equivalent to a certain amount of training sample, thus providingtheoretical foundation for designing the Supervised Learning and Semi-supervised Learningtechniques in latter section of this article. Secondly, this article will explain how to make useof linguistic knowledge gained from the formalized priori model and add it intoSemi-supervised Learning techniques, to create a knowledge-coupling-based semi-supervisedpolarity text classification method(Chapter 4);then this article will proceed to explain howto how create a knowledge-coupling-based active learning polarity text classification methodusing active learning techniques, also making use of linguistic knowledge gained from theformalized priori model(Chapter 5). Lastly, this article will present some conclusions thatcan be drawn from the study and bring forward issues worth further exploration in the future.
Keywords/Search Tags:WORD:Prior Knowledge, Machine Learning, Semi-supervised Learning, Active Learning, Sentiment Category
PDF Full Text Request
Related items