Font Size: a A A

Studies On Clustering Blog Text Based On Certain Topic And Sentiment Polarity

Posted on:2011-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:J PangFull Text:PDF
GTID:2178360305981766Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the requirements of getting information increase constantly, at the same time, the enormous network amount of information bring people huge trouble to obtain requisted information. The typical information of the network information is the blog text, called blog for short.The blogs contain a large number of reviews, the bloggers's sentiments and attitudes to people, things, events and so on (collectively referred to as opinions). These sentiments and attitudes include a lot of valuable information. Mastering these "opinions", "sentiment polarity" or "attitudes"may help people gain more valuable information and do effective choose, such as telling people which commodity should be purchased, helping companies making market strategies, and helping government get hold of network pubic opinion. At present, analyzing and mining the opinions of bloggers embedded into blogs become one of hotspots in the research field of data mining.Opinion Mining is a technology which is applied to mine the opinions from the content of the forums and discussion groups. Generally, Opinion Mining has four subtasks:(1) Topic Extraction (2) Holder Identification (3) Claim Selection (4) Sentiment Analysis. In the research field of Opinion Mining, foreign scholars study earlier and are focus on English text; and internal scholars study later and leave many foundational words being researched. At present, most of literatures divide sentiment polarity (the attitudes to objective things for people, such as like/dislike, praise/degrade) into two categories (positive and negative) or three categories (positive, neutral and negative). As we know, the sentiments of people are abundant; it is not enough to express the sentiments of bloggers embedded into the blogs text only making use of the two or three categories. Nowadays, the researches of clustering blog text by the authors, date and topic already have precedents; the researches of, however, clustering Chinese blog text by the sentiment polarity is reported rarely.This thesis, by the sentiment polarity of the bloggers, adopts the clustering technology to group the Chinese blog text so as to achieve the purpose of subdividing the sentiment polarity. Through study, it is found that these sentiments may be scattered although the blogs text include affluent sentiments; to the contrary, the sentiment polarity contained in the blog search results (the titles and the snippets) is relatively intensive. So this study use blog search results (the titles and the snippets) as the concise representation and the objects studied.Firstly, the thesis design a "crawler" to get the results searched with Google blog search according to certain topic (The topic "The Founding of a Republic" and the topic "Xiang Liu" are applied in the experiments in the thesis). Then, we use the method of manual annotation to label the data set into three categories according to the sentiment polarity (positive, neutral and negative). After that, we apply the Chinese Academy of Sciences ICTCLA Chinese word segmentation tool to deal with the blog search results, then adopt the lexicon based method to extract the sentiment words from these strings of words(this thesis adopt the two Chinese sentiment lexicon: Hownet and NTUSD). Then using "the standard graph-based document representation model" proposed by Adam Schenker, Horst Bunke and so on (GBR model for short) and the integrated graph-based document representation model designed by this thesis's author (SoB-graph model for short) represent separately the text in the data set; on that basis apply the k-medoids algorithm which is based on the graph-based document representation model proposed by Adam Schenker, Horst Bunke and so on to cluster Chinese blogs by embedded sentiment. Lastly, using the centroids representation method to show the sentiment clusters (the sentiment words of the document corresponding to the centroid are used to represent the cluster) and using three common metrics (the precision, the entropy and the rand index) in the Ground Truth method to evaluate the clustering results.The experiment results show that the performance for clustering Chinese blogs by embedded sentiment with the SoB-graph model is better than with the GBR model.
Keywords/Search Tags:network public opinion, blog, opinion mining, sentiment polarity, clustering
PDF Full Text Request
Related items