Font Size: a A A

Research On Active Learning For Sentiment Classification

Posted on:2014-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:S F JuFull Text:PDF
GTID:2248330398462907Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
More and more people express their opinions and emotions on the internet with thedevelopment of Internet. It is dramatically time costly to analyze the sentiment informationof such a large amount subjective texts manually. As a result, sentiment analysis emergeswith such a background, and has drawn a significant attention in the field of the naturallanguage processing community.Sentiment classification is one of the most widely studied tasks of sentiment analysis.It aims to automatically distinguish whether a subjective text supports a topic or not,namely the distinction between positive or negative emotion. Previous studies find thatmachine learning achieves a good effect in sentiment classification. This method needs toannotate a large quantity of samples which are used as the training set. Then, the trainingdata is used to train a classification model. However, the annotation work is timeconsuming and cost sensitive, How to minimize the annotated data and ensure the idealclassification performance at the same time is a meaningful work. This paper carries outthe following research aspects based on the active learning methods for sentimentclassification:First, this paper analyzes different factors that affect active learning methods insentiment classification and the factors includes uncertainty, representativeness, diversityand feature information. In detail, first, the certainty of the sample’s classification ismeasured by uncertainty; Second, the representative factor is measured by pre-clustering;Third, the diversity is measured by calculating the distance between the unlabeled samplesand the labeled ones; Finally, the size of the feature information is measured by the amountof features in a sample. An elaborate analysis of the merit and demerit of various factors isconducted based on experiments. Second, it is observed that the numbers of positive and negative samples are oftenimbalanced in the corpus, which seriously degrades the performance of the machinelearning classifier. Note that corpus selected from the imbalanced corpus by traditionalactive learning methods remains imbalanced, which cannot be a good solution to thissituation. As a result, research on active learning method with imbalanced corpus hasgreat application value and its own challenges. This paper absorbs the advantages oftraditional active learning methods with not only considering the amount of information ofthe text, but also taking care of the balance of the samples. Accordingly, co-selecting, anovel active learning approach is proposed, which greatly reduce the annotation cost. Onthis basis, the accuracy to find automatic labels of positive samples is high without toomuch manual intervention. Therefore, a modified approach, named co-selecting-plus, is putforward as well to further reduce the annotation cost.Finally, research on active learning on sentiment classification by selecting bothwords and documents is conducted. The words with additional information (usuallyemotional words) make great contribution to the classification results. This paper recordsthe difference between the cost of annotating words and documents. The complexity of adocument makes the annotation cost much more time and labor than that of a word. Thispaper proposes a measure of calculating the weight of a word and a document. Besides,semi-supervised learning can also reduce annotation cost, and high quality seed samplescould highly improve the performance of semi-supervised learning methods. Experimentalstudies demonstrate that the active learning method of labeling both words and documentsis more effective than traditional ones. Furthermore, labeling both words and documentsis a good solution for seed selection in semi-supervised sentiment classification as well.
Keywords/Search Tags:Sentiment Classification, Active Learning, Imbalanced Classification, Semi-supervised Learning, Seed Selection
PDF Full Text Request
Related items