Studies On Clustering Blog Text Based On Certain Topic And Sentiment Polarity

Posted on:2011-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:J Pang

Full Text:PDF

GTID:2178360305981766

Subject:Computer Science and Technology

Abstract/Summary:

With the development of the Internet, the requirements of getting information increase constantly, at the same time, the enormous network amount of information bring people huge trouble to obtain requisted information. The typical information of the network information is the blog text, called blog for short.The blogs contain a large number of reviews, the bloggers's sentiments and attitudes to people, things, events and so on (collectively referred to as opinions). These sentiments and attitudes include a lot of valuable information. Mastering these "opinions", "sentiment polarity" or "attitudes"may help people gain more valuable information and do effective choose, such as telling people which commodity should be purchased, helping companies making market strategies, and helping government get hold of network pubic opinion. At present, analyzing and mining the opinions of bloggers embedded into blogs become one of hotspots in the research field of data mining.Opinion Mining is a technology which is applied to mine the opinions from the content of the forums and discussion groups. Generally, Opinion Mining has four subtasks:(1) Topic Extraction (2) Holder Identification (3) Claim Selection (4) Sentiment Analysis. In the research field of Opinion Mining, foreign scholars study earlier and are focus on English text; and internal scholars study later and leave many foundational words being researched. At present, most of literatures divide sentiment polarity (the attitudes to objective things for people, such as like/dislike, praise/degrade) into two categories (positive and negative) or three categories (positive, neutral and negative). As we know, the sentiments of people are abundant; it is not enough to express the sentiments of bloggers embedded into the blogs text only making use of the two or three categories. Nowadays, the researches of clustering blog text by the authors, date and topic already have precedents; the researches of, however, clustering Chinese blog text by the sentiment polarity is reported rarely.This thesis, by the sentiment polarity of the bloggers, adopts the clustering technology to group the Chinese blog text so as to achieve the purpose of subdividing the sentiment polarity. Through study, it is found that these sentiments may be scattered although the blogs text include affluent sentiments; to the contrary, the sentiment polarity contained in the blog search results (the titles and the snippets) is relatively intensive. So this study use blog search results (the titles and the snippets) as the concise representation and the objects studied.Firstly, the thesis design a "crawler" to get the results searched with Google blog search according to certain topic (The topic "The Founding of a Republic" and the topic "Xiang Liu" are applied in the experiments in the thesis). Then, we use the method of manual annotation to label the data set into three categories according to the sentiment polarity (positive, neutral and negative). After that, we apply the Chinese Academy of Sciences ICTCLA Chinese word segmentation tool to deal with the blog search results, then adopt the lexicon based method to extract the sentiment words from these strings of words(this thesis adopt the two Chinese sentiment lexicon: Hownet and NTUSD). Then using "the standard graph-based document representation model" proposed by Adam Schenker, Horst Bunke and so on (GBR model for short) and the integrated graph-based document representation model designed by this thesis's author (SoB-graph model for short) represent separately the text in the data set; on that basis apply the k-medoids algorithm which is based on the graph-based document representation model proposed by Adam Schenker, Horst Bunke and so on to cluster Chinese blogs by embedded sentiment. Lastly, using the centroids representation method to show the sentiment clusters (the sentiment words of the document corresponding to the centroid are used to represent the cluster) and using three common metrics (the precision, the entropy and the rand index) in the Ground Truth method to evaluate the clustering results.The experiment results show that the performance for clustering Chinese blogs by embedded sentiment with the SoB-graph model is better than with the GBR model.

Keywords/Search Tags:

network public opinion, blog, opinion mining, sentiment polarity, clustering

Related items

1	A Study On The Method Of Generating Chinese Opinion Abstracts For Product Reviews
2	Research On Opinion Mining And Sentiment Analysis For Chinese Micro-blog
3	Web Public Opinion Orientation Analysis Based On Technique Of Sentiment Lexicon Extension
4	Analysis Of Micro-blog Public Opinion In Sentiment Context
5	Research On Opinion Mining Based On Product Comments
6	The Transition Of Public Opinion Formation Characteristics And Public Opinion Guide Strategy Research
7	The Research Of Micro-blog Public Opinion Based On Sentiment Analysis
8	Microblogging Network Public Opinion Analysis And Opinion Leaders In Mining
9	Take The "AIDS-girl" Event As An Example To Study The Public Opinion On Blog And The Strategy Of Guiding The Public Opinion On Blog
10	Analysis On The Evolution Of Mass Unexpected Incident Micro-blog Public Opinion