Font Size: a A A

Study And Implementation On Data Cleaning And Sentiment Analysis Techniques For Chinese Microblog

Posted on:2014-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2268330425491857Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As a new information carrier, the microblog has become popular and indispensible in people’s daily life. In microblog, there are a large number of valuable comments on celebrities, events and products, which can express users’ sentiment orientation and play a more and more important role in the emergence and propagation of Web public opinion. For the new characteristics of Chinese mocroblogs, this thesis studies on data cleaning, sentiment orientation analysis, and their related techniques.Firstly, for the problem of spam and near-duplicate microblogs, the thesis studies on the microblog data cleaning approach. Recently, large numbers of spam microblogs and near-duplicate microblogs cover every corner of the microblog space. They have brought in adverse effects on the accuracy of information retrieval and affected the credibility of further analysis. Eliminating the spam and near-duplicate microblogs has become a serious problem in the relative research area. To tackle this problem, the characteristics of the spam and near-duplicate microblogs are analyzed based on the statistical analysis results of massive real-world microblog data, and a filtering approach with feature selection and double content similarity detection for microblog text stream is proposed. The proposed method can firstly filter out spam microblogs through the URL links, character rates and high frequency words. Then the near-duplicate microblogs are eliminated through the subsection-based and index-based filters. Experiments show that the proposed method can effectively purify the microblogs by filtering out the spam and near-duplicate microblogs.Secondly, according to the characteristics of "straightforward emotion expression", this thesis studies on the sentiment orientation analysis method for Chinese microblog. The "straightforward emotion expression" means that people in the microblog posts used to employ emoticons, interjections, and degree adverbs to expression. The existing sentiment lexicons and previous sentiment analysis methods can be applied to sentiment analysis task for Chinese microblogs, but these methods usually ignore the new characteristics of the microblog content. There is a lack of studies on building sentiment lexicons, modifier lexicons and related processing methods about Chinese microblog. Therefore, this thesis analyzes the existing emotional lexicon, and builds a new sentiment lexicon as well as an auxiliary lexicon with emoticons, degree adverbs and interjections for Chinese microblogs. At the same time, a novel algorithm is proposed for sentiment analysis based on the new lexicons. Experiment results show that the construction of the lexicon with sentiment weight contain: most common emotional words and the proposed orientation classification algorithm car achieve higher precision, recall and F-Score.Finally, based on the above researches, we develop a prototype system of sentimen orientation analysis for Chinese microblog. The system has implemented the components o data crawling, data parsing, data cleaning, sentiment analyzing and result visualization Through crawling and analyzing, the prototype system gives support to understand the emergence and propagation of public opinion in Chinese microblogging space.
Keywords/Search Tags:Microblog, Date Cleaning, Sentiment Lexicon, Sentiment Orientation, Publi(?)Opinion Analysis, Opinion Mining
PDF Full Text Request
Related items