Font Size: a A A

The Topic Mining Based On The Web Comments

Posted on:2015-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X F ShenFull Text:PDF
GTID:2268330428964529Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet, we step into a Big Data Time gradually. Moreand more people can post their own viewpoint, mood, and other information on the networkplatform. Hence comments contain a great amount of information, for example, product reviews canbe analyzed to draw the conclusion whether consumers like the goods. The comments of a videocan reflect the attitude of the video and emotional expression of reviewers. The comments of blogarticle can show the reviewers’ viewpoints to the article as well as the emotion expression to theblogger. We can know the market situation of a target product, the formation process of hot publicopinion, people’s consumption habit by analyzing the comments. It has important theoreticalsignificance and realistic significance for the individual, the enterprise or government.A novel method for extraction of valuable comments based on Chinese word similarity isproposed in this paper. However, it is different from newspapers, magazines and other media thatcomments belong to short text, whose syntax is not rigorous and has such problems as use of cyberwords, spoken language and irregular phraseology. Those problems make the information ofcomments more complex. Therefore, in order to extract the information of reviews more accurately,we need to have it preprocessed before the classification. We divide comments into valuablecomments, emotional comments and comment spam. First, simple rules are used to filter commentspam. Second, an extraction method of valuable comments based on semantic similarity in HowNetis applied. Third, calculating semantic orientation of the comments in positive space and negativespace by using the statistical learning methods, so as to realize mixed orientation judgment ofcomments.We propose topic clusters based on LDA topic model. Unfortunately, a challenge of topicmining on news comment is that each comment comes from different person and each person hashis or her individual style, what is worse, relatively brief, less information in each comment, addingspelling problem, more Internet Glossary and containing extensive information about one event,pose a tougher challenge to researchers. Each comment is able to be seen as a short document due toeach comment contains an attitude of reviewer from a certain angle about a certain topic, and thereis no correlation between them. Although that, the comments are all about the same topic, so we cancluster them, allowing users to easily and conveniently know that other users views about events inall aspects. In this paper, we propose topic clusters based on LDA topic model generating eventtheme, meanwhile, using Wikipedia concept to describe feature word of comments in order to more effectively eliminate the impact from the drawback of short text in the real-word environment. Inaddition, taking into account K-Means algorithm being sensitive to outliers, we cluster thecomments with K-Medoids instead of K-Means which is mostly proposed in earlier papers.We embark on the urgency of the information mining from text comments and analyzelanguage features of the reviews to filter comment spam, study the short text similarity, cluster thetopics from the effective reviews, mining the interesting knowledge from comments. This paperexpounds the necessity and rationality of this research.
Keywords/Search Tags:Topic Mining, Comment Spam, LDA, Similarity, Wikipedia
PDF Full Text Request
Related items