Font Size: a A A

Research On Key Techniques Of Opinion Mining For Chinese Web Reviews

Posted on:2014-02-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:F LiFull Text:PDF
GTID:1228330398490216Subject:Education Technology
Abstract/Summary:PDF Full Text Request
The Web has stored huge number of comments about social events and focused personages, as well as the products. These comments have important application value in providing service for the government, the manufacturers and the consumers. However, the amount of web information is increasing in exponential order, and we are faced the difficult of information overload. Information obtaining in a manual way is extremely time and effort consuming. There is an urgent need for an effective means of data collating, analyzing and extracting, which is expected to provide valuable, clear and comprehensive information to the user. Therefore, the technology of opinion mining emerges and has attracted more and more attentions. It has become a hot research area in data mining and natural language processing.This paper focuses on aspect-based opinion mining from product reviews in Chinese, including aspect and relation extraction and sentimental words identification. Firstly, it extracts the aspects and their hierarchical relations from review corpus with topic model. Then, it distinguishes sentimental words as context-free words and context-dependent words, and identifies them with the methods of word explanations based and association rules based respectively. Lastly, it counts the results according to the aspects, and displays them in a hierarchy. The research work and innovation are as follows.(1) This paper proposes a review-topic model (RTM) for aspect and hierarchical relation extraction. The model extends the Latent Dirichlet Allocation by adding a review indicator layer between the document and topic layer. It represents each document with a mixture of review indicators, each indicator is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. The basic idea of RTM is that the review indicators can be effective for the generation of words in the documents. After parameter estimation, the related aspects are assigned to one topic, and the hierarchy of indicators, topics and aspects could be obtained with the distributions of indicator-topic and topic-word. Experimental results show that the precision, recall, and F-measure of aspect extraction are8.6%,3%,7%over the LDA model. Besides, the RTM model can obtain the hierarchical relations between aspects.(2) Adding prior knowledge of word distribution into topic model can improve its performance. This paper researches how to incorporate prior knowledge into the RTM model, and presents a review-topic model with Dirichlet Forest prior (RTM-DF). The RTM-DF model extends the RTM model by replacing the Dirichlet prior over the topic-word multinomial with the Dirichlet Forest prior, which can incorporate word correlation. Firstly, the correlation between words is calculated to generate a set of Must-Links and Cannot-Links. Then the structures of Dirichlet trees are obtained through encoding the constraints of Must-Links and Cannot-Links. Words under the same subtree are expected to be more correlated than words under different subtrees. Lastly, each topic is assigned a tree by the Dirichlet Forest distribution, and the topic-word multinomial is sampled conditioned on these trees. After parameter estimation, the distribuitons of indicator-topic and topic-word are used for aspect and hierarchical relation extraction. We conduct experiments on a synthetic dataset and a review dataset. Both of the experimental results show that the RTM-DF model performs much better than the RTM model. It can improve the precision and F-measure of aspect extraction by5%,3.7%respectively.(3) This paper proposes a noun phrase extracting method which is based on rule and co-occurrence probability. After words segmentation and part of speech tagging, word combination rules are used to extract candidate noun phrases. Then, the co-occurrence probabilities between words are utilized for filtering out noisy phrases.(4) This paper proposes a context-free sentimental words identifying method based on word explanations. Firstly, candidate sentimental vocabulary is built according to the existing emotional resources. Then, for each word in the vocabulary, all the explanations in Modern Chinese Dictionary are extracted, and a multi-feature fusion method is used to calculate the orientation of the explanations. And the results are used to identifying the context-free sentimental words with a strategy of multiple cycles. As a result, a context-free sentimental dictionary is built, and it is applicable to any field.(5) This paper proposes a context-dependent sentimental collocative phrase mining method based on association rules. The common collocative phrases are identified from the corpus with the technique of association rules. Then, their orientations are calculated according to the context. As a result, a context-dependent sentimental collocative phrase collection is built. The experiments are conducted on the test collections of COAE2011. The results show that the effect of sentimental words identification has been significantly improved.
Keywords/Search Tags:Opinion Mining, Aspect, Sentimental Words, Topic Model, Dinchlet Forest, Association Rules
PDF Full Text Request
Related items