Font Size: a A A

Research On Key Problems In WEB Text Sentiment Classification

Posted on:2009-04-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:1118360245969618Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As the rapid development of the computer technology and Internet, online documents have become one of the major modern information media as well as an indispensable information source in people's lives. While the Web2.0 is coming, people tend to take the initiative to change the acquisition, publishing, sharing and dissemination of information. Meanwhile, for that users are involved in the generation of information, more and more personal opinioned contents are full of the Internet. Such contents are meaningful and valuable for many applications, such as e-commerce, network community, network information security, web search engine and so on. The automatic sentiment analyzing of the opinioned contents on the wed is recently becoming a hotspot in the study on web information processing, and its core technology is text sentiment classification.In this dissertation, three problems are investigated, which includes Chinese word segmentation, text sentiment classification, and Weblog opinion retrieval. The main contributions of this dissertation are summarized as follows:Firstly, we designed a multi-model-based hybrid Chinese word segmentation system for web text. According to the characteristics of web text, we focused on the Out-of-Vocabulary(OOV) word identification problem in Chinese word segmentation task, as well as the disambiguation and the ability to deal with huge amount of text. To deal with the OOV words, we proposed POC-NLW character tagging templates to represent the word-internal component mechanism in Chinese words on the character-level, which were combined into the Hidden Markov model to implement the Chinese word segmentation task as a character sequence tagging procedure. Besides, we also implemented the rule-based text preprocessing, the dictionary-based complete segmentation, and the word-level-based N-Gram langue model, and combined them sequentially into a multi-model hybrid hierarchical system. Experimental results prove that the POC-NLW template based tagging method perfumes well on OOV word identification task, and the hybrid system can get high marks on both the overall segmentation precision and the recall of OOV words. In addition, the method proposed here is efficient and effective to deal with huge web text.Secondly, we discussed the web text sentiment classification problem, including text subjectivity classification and text polarity classification. We have investigated several N-Gram templates to represent text features. Four feature weighting methods were used, namely Bool, absolute word frequency, normalized word frequency and TFIDF. We proposed a global TFIDF signification indicator as well as a "global-filtering and local-weighting" strategy for feature selection. While constructing the sentiment classification model, we investigated the Maximum Entropy (Maxent) model against the Na(?)ve Bayes model. Focus were put on the Maximum Likelihood estimation problem in Maxent modeling, and two kinds of priors, Gaussian and exponential, were introduced to improve the basic Maxent models. Based on the detailed experiments and analysis on a movie review corpus, we confirmed that by using high-order language features and TFIDF-based feature selection method, the Maxent models with exponential priors perform best in text sentiment classification task.At last, following the TREC Blog Track evaluation schedule, we discussed the Weblog opinion retrieval problem, which included the Weblog topic retrieval and the opinion retrieval sub-task. At first, we investigated the Weblog document and designed some specialized preprocessors (such as HTML parser, noise tag filter, text extractor, stemmer) to parse the original blog dataset. Then, based on the Indri retrieval system, we reformulated the query topics by using Indri's structural query language, and extended these topics by utilizing web search engine, and also, we tried the filed-base retrieval function supported by Indri. These methods greatly improved the topic retrieval performance. Based on the well designed basic topic retrieval system, we add the Maxent model based sentence-level text sentiment classification technique into our opinion retrieval system. To get a suitable opinion detector for the blog dataset, we proposed a self-leaning strategy for knowledge transferring between different corpuses. On the other hand, we also proposed a document-level Maxent opinion detector, which is combined with the former sentence-level Maxent model to form a hierarchical system to detect the opinioned content in Weblog documents. Experiments on the Blog Track dataset show that, the system proposed in this dissertation is a state-of-the-art Weblog opinion retrieval system.
Keywords/Search Tags:Chinese word segementation, POC-NLW template, sentiment classification, maximum entropy, opinion retrieval
PDF Full Text Request
Related items