Font Size: a A A

Research On Domain-Oriented Public Sentiment Analysis Technology

Posted on:2012-12-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:1118330335950229Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technologies, information in web mounts up exponentially. In the meanwhile, interaction technologies of web2.0 enable people to communicate on the Internet and post variety opinions and comments. There has been a variety of public sentiment information on the Internet. Therefore people are facing great difficulties in searching for desired information because the information is always hidden in information ocean. How to get public sentiment information about domain-specific events? The combination of focused crawler technology and sentiment analysis technology make it possible to resolve this problem. By analyzing public sentiment information in specific domain can support decision making of policy-making departments, help enterprises improve program plans, and provide users with useful information. To meet these demands, this dissertation proposes a lot of key techniques, theories and methods as shown in the following three sections:1. The dissertation proposes Focus Crawler with incremental capability based on synthetic estimate value. Subjects on the web are distributed interweavedly, but the same subject on web has certain distribution rules. We summarize these rules as Hub, Sibling/Linkage Locality, Site subject, Tunnel, Topic trap. We design Focus Crawlers based on the proposed rules. Recent years have witnessed a lot of research on focus crawlers. However, these studies have some limitations. They improved recall at the cost of sacrificing harvest and efficiency. On the other hand, recall would decrease if harvest were satisfied. In this dissertation, we propose front-end/back-end classifiers as the part of link's topic-relevance forecasting. The front-end classifier trains classification model based on linkage context graph, uses the webpage visualized content block partition technique, and predicts whether the link of webpage is topic-relevance based on link's synthetic values. It endows focus crawlers with the ability of going through tunnel, i.e., enables focused crawler to start from some topic-relevant webpage, pass through some irrelevant webpage and reach other topic-relevant webpage. The back-end classifier is used to recognize topic-relevant WebPages based on text content of WebPages. The experimental results show that our focused crawler can dramatically improve recall rate, harvest rate and efficiency. 2. The PU-Oriented Text Classifier Based on Unsupervised Clustered Learning Algorithm is proposed. Traditional text classification models are based on machine learning and need a large amount of labeled corpus as train datasets. So a large number of labeled training documents/webpages (often negative training data) are needed to build accurate classifiers. In text classification, the labeling is typically done manually by reading the documents/webpages, which is a labor-intensive and time-consuming process. Collecting negative training examples is especially painstaking and tedious because (1) negative training examples must uniformly represent the universal set except the positive class (e.g., sample of a nonhomepage should represent the Internet uniformly excluding the homepages), and (2) manually collecting negative training examples tends to cause unconscious bias because of human's unintentional prejudice, which could deteriorate classification performance such as accuracy, precision, recall, etc. PU-Oriented text classifier aims to solve the problems in machine learning that no labeled negative documents are available in the training example set or negative examples are very difficult to collect. Traditional classification algorithm cannot obtain good performance without sufficient positive and negative training dataset. When using traditional classifier to conduct PU-oriented text classification, the key is the extraction of reliable negative training example from unlabeled documents/webpages. The PU-oriented text classification based on machine learning often adopts a two-step approach by making use of both positive and unlabeled examples. At the first step, a lot of reliable negative documents are identified. At the second step, the classifiers are constructed iterative based on training datasets. In this dissertation, the clustering based reliable negative example extraction (CBRN) algorithm is proposed. The number and the accuracy of reliable negative examples extraction is improved. Existing classification is improved, which builds a set of classifiers by iterative applying the SPY-SVM algorithm. This approach randomly selects s% of the documents from the positive set P as the spies and add them into unlabeled datasets. These spies can help improve the accuracy of identifying the negative from unlabeled datasets, and train the classifier iterative until termination condition meets. Experimental results show that our method outperforms other algorithms in terms of accuracy, recall, precision and Fl-measure.3. Opinion mining or sentiment analyzer is to extract sentiment (or opinion) about a subject from online subjective text documents. At first it classifies the sentiment of an entire document about a subject. It can provide valuable information for government, enterprise and users. The dissertation proposes three semantic orientation analysis (positive, negative and neutral semantic orientation analysis) algorithms for Chinese text. These three methods are described as below:1) Polarity Classification of Public Health Opinions in Chinese text. With frequently bursting of public health events over the world, people are increasingly expressing their views on these events online. Government agencies need to response and make policies according to these views. We study Chinese opinion mining under the context of public health. This dissertation proposes two complementary approaches-a sentiment word based approach and a machine learning approach. The Chinese sentiment word based approach extracts an opinion quadruple from each single sentence based on rules. We notice that different types of sentences have different contributions to the overall polarity and take into account three types of sentences:common sentences, first-person sentences, topical sentences. We give different weights to these three types of sentences when synthesizing the overall polarity scores of entire review through weighted average. The machine learning based approach extracts unigrams and opinion phrase features by labeling train datasets, selects features by information gain method and train sentiment classification model using ten-fold cross validation. The experiment results show that both methods achieve good performance.2) This dissertation proposes a string kernel based approach for sentiment classification on Chinese reviews. Machine learning based sentiment classification approaches depend on a feature vector which represents a text. They usually utilize words or n-grams as features and construct feature vectors according to their presence/absence or frequencies. They use these feature vectors to construct sentiment classification model. The selection of feature set is considered as the most important point in classifying documents. The features extract module not only needs comprehensive experts'knowledge, but also ignores the information on word positions, i.e., may lost important information when extracting features such as the position of words and mutual information between words. The word order is extremely important to sentiment analysis. This dissertation proposes sentiment classification for Chinese reviews using machine learning methods based on string kernel. The features are all possible ordered subsequences of characters. It can construct sentiment classification model if important information are not lost. We conduct experiments to show the power of our approach as well.3) Sentiment analysis of Chinese documents from sentence to document level:This dissertation proposes a rule-based approach including two phases:first, determining each sentence's sentiment based on word dependency and context modifier component, second, aggregating sentences polarity scores to predict the document sentiment. We assign sentences with different weights to adjust their contribution to the overall polarity based on five features, including position of the sentence, weight term/tf-isf weighted of the sentence, the similarity between the sentence and the headline, the occurrence of keywords in the sentence, and the first-person mode. We report the experimental results of comparing our approaches with three machine learning-based approaches based on two datasets of Chinese articles. Our approach achieves similar performance in comparison to SVM. Moreover, our rule-based approach is much more portable and adaptable to various topic domains since it does not require the manual annotation of large amounts of training data. These results illustrate the effectiveness of our proposed method and its advantages against learning-based approaches.
Keywords/Search Tags:Focus crawling, PU text classification, opinion mining, sentiment classification, string kernel
PDF Full Text Request
Related items