Font Size: a A A

Research On Key Problems In Text Sentiment Classification And Opinion Summarization

Posted on:2013-02-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:D M ZhangFull Text:PDF
GTID:1118330374480718Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Human natural language text contains two kinds of information:objective and subjective information. The subjective information represents one's attitude, standpoint and opinion to a specific object. Text sentiment analysis focuses on subjective information to recognize, classify, extract and annotate the expression of sentiment, opinion and effect in the content.With the rapid increase usage of internet, there are more and more subjective information appearing at the social medium, such as forum, community, blog and shopping websites. Both individual and organization became strongly relying on the review information obtained from the internet to make their own decisions. However, due to the huge amount of information available on the internet, one has to search, check and judge each review one by one before the person or organization can make the final decision. In this situation, it will be very useful to first summarize the relevant huge amount of information; this summary will be valuable for both the customer and manufacturer. This kind of work is called opinion-based multi-document summarization. Furthermore, it will greatly enhance the customers' efficiency to obtain the information if there is an automatic analysis of the original information, for example, which is positive attitude, which is negative attitude, and to what extent. This is called sentiment classification.This thesis focused on the opinion-based multi-document summarization and sentiment classification, two fields in text sentiment analysis. It contains the following three parts:1) Developed a new method for the opinion-based multi-document summarizationCurrent opinion-based multi-document summarization that mainly based on the feature or aspect of the review is called feature/aspect based opinion summarization. This is largely depended on the accurate recognition of opinion feature and opinion word, however in reality, the opinion feature or opinion word is often not explicitly appeared in the sentence. Therefore, the feature/aspect based opinion mining will miss the opinion that is implied in the sentence due to the failing of recognition of the implicit opinion, and affect the performance of the following summarization. As to accurately recognize the feature/aspect requires the domain knowledge, thus make it domain dependent. Furthermore, this feature/aspect based method mainly focuses on the recognition and evaluation of each feature; therefore, it cannot provide summary information about the main topic and basic idea that covers all the opinions.To overcome this problem, this thesis proposed a general, domain-independent multi-document opinion summarization method. This new method utilizes the traditional extractive summarization method, combining Latent Dirichlet Allocation (LDA) and semantic orientation for mullet-document summarization. This method first builds the model of the sentence sets from multi-document with LDA, and explores the latent topics, obtains the sentence-topic distribution and topic-word distribution through Gibbs sampling, performs part of speech analysis and computes semantic orientation of word with WordNet and SentiWordNet. Secondly, it evaluates the importance degree of topic and word sequentially, and then based on these results and semantic orientation of word, it evaluates the importance degree of sentence. Finally, it sorts the sentence by the importance degree of sentence, obtains the extractive abstract after getting rid of the redundancy according to the topics. This identifies the important topic from the opinion text with LDA model and the strong subjective opinion on such topic with semantic orientation method. Experiment results indicate that results with this new method are comparable to expert summarization.2) Developed a new ensemble learning based method for sentiment classification of unbalanced dataCurrent binary sentiment classification has been focusing on improving the performance of classification, while the unbalanced data, in which the number of samples in one category is several folds of that of another category, is neglected. Majority of the study on sentiment classification has been on the balanced data, so these methods perform well on balanced data, while are unable to maintain the same performance in practical applications. Therefore, it is imperative to study and develop new methods to deal with unbalanced data for sentiment classification and to improve the performance of sentiment classification in practical applications.To this end, this thesis proposed a new method of sentiment classification that combines unbalanced data classification method and ensemble learning technique. As a hybrid method, it considers both algorithm and datasets. In the framework of ensemble learning, it integrates three different methods: under-sampling, Bootstrap re-sampling and random feature selection to process the training set. It thus combines the advantage of the three methods to obtain the subset with larger diversity in both sample space and feature space, and leads to a larger diversity base classifier. In the end, it can enhance the ability of the ensemble classifier. Experiment on the unbalanced data for sentiment classification show that such new approach could significantly improve the classification performance on unbalanced data.3) Developed a fine-grained sentiment classification and analyzed the effect of pre-process of text on sentiment classificationMajority of study in sentiment classification focus on binary sentiment classification which categories subjective text as positive or negative. However, in reality, text with subjective information cannot always be simply classified as positive or negative. For example, the review information from many shopping websites contains ranking information from1star to5stars. In this case, classifying them only into positive or negative cannot meet the practical need. To solve this problem, this thesis proposed a method called fine-grained sentiment classification. This method not only considers the positive or negative polarity of the review text, it also addresses the ranking strength of the review text. It further analyzed the essential difference between the fine-grained sentiment classification and the traditional multi-class categorization.Considering the difference between the sentiment classification and the traditional topic-based categorization, to better study the fine-grained sentiment classification, this thesis used supervised machine learning method to analyze various components that affect the sentiment classification. Specifically, it compared performance of the combination among the number of feature, stop words list, text feature selection, feature weight computation and text categorization method on sentiment classification. These studies indicated that there were differences between sentiment classification and topic-based classification when applied stop words list and feature selection in text categorization. Finally, to study the fine-grained sentiment classification of Chinese text, this thesis did experiment in analyzing reviews in Chinese scientific literature using machine-learning method. In the experiment, the usage of ranking information correspondent to the review text as category label solved the problem of manual annotation. The experiment shows that fine-grained sentiment classification is not only different from the topic-based multi-class categorization, but also difficult to classification compared to traditional multi-class categorization and binary sentiment classification.
Keywords/Search Tags:Sentiment Classification, Opinion Summarization, UnbalancedData Classincation, Ensemble Learning, Semantic Orientation
PDF Full Text Request
Related items