Font Size: a A A

The Research On Several Problems In Social Data Mining

Posted on:2016-09-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WuFull Text:PDF
GTID:1108330503493723Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The weave of Web 2.0 has triggered the flourish of “people centered” applications in which normal web users are not only the consumers but also the producers of web content. For example, through Microblog applications like Twitter, normal users can publish their current status and connect with each other by following, replying or retweeting actions; With Crowdsourcing services like MTurk, normal users can participate into manual labeling tasks(e.g. categorize a document, translate a sentence or mark the existence of a human face inside a figure) to earn financial incentive;With online stores like Amazon and EBay, normal users can rate products and post their comments. Such product reviews can not only help other users in purchasing but also provide feedback for the manufacturers. With collaborate editing services like Wikipedia, normal users can work together to finish a huge task, like building an encyclopedia from scratch. According to recent statistics in Alexa, among the top 20 most visited web sites, 11 are normal user centered. We call these applications that leverage social power(the contribution of normal users) as social applications and call the data generated by normal users with social applications as social data.Various social applications have boosted the explosion of social data. With proper use, social data can benefit a wide range of data mining and NLP tasks. For example,Microblog data can help to detect events and predict stock index trend; Crowdsourcing data can provide manual labeling for training supervised models; Product reviews can provide labeled corpus for training sentiment classification models; Wikipedia data can contribute to the creation of knowledge bases and realize the vision of semantic web.However, the following challenges prevent social data from direct adoption in practice.First, most participates of social applications are normal web users who neither pass aqualification test nor take training classes in advance. As a result, the generated social data is problematic and error-prone. Second, some users deliberately distribute rumors and spams which bring harmful information to social data. Third, to facilitate the usage,social applications usually allow users to generate contents in free or semi-structural format which are difficult for machines to understand and process.To tackle these challenges, we focus on three social applications and study four problems in mining social data from coarse to fine granularity:On the level of multiple users, we study how to improve the accuracy and efficiency of crowd sequence labeling. The main challenge is the low quality of crowd labeling. To correct errors and exclude vandalism, we present a statistical model that integrates three principles:(1) the majority agreement proves the correctness of an labeling;(2) a correct labeling improves the credibility of the corresponding annotator;(3) a correct labeling enhances the correctness of other labeling which share similar linguistic or contextual features. By applying the proposed model, we can generate a unified and high quality labelling from crowd inputs. We also extend the proposed model with active learning strategies. In this manner, the cost of crowd labeling can be reduced while the accuracy can be reserved.On the level of a single user, we mine a special group of microblog users, the marionettes who perform specific tasks to earn financial profits. The following facts motivate the emergence of marionette user purchase: 1) to increase the number of followers and fake their popularity, some users purchase marionette users to follow them.For celebrities, a large number of followers show their social impact and can increase their power in advertisement contract negotiations. For normal users, a relatively large number of followers represent rich social connections and promotes one’s position in social networks. and 2) to increase the retweeting count. On many microblog platforms(e.g., Sina Weibo), the retweet count is adopted as the key metric to select top stories1. As a result, some merchants are willing to purchase more retweets to promote their messages for commercial purpose. The fabricated follower or retweet counts not only mislead normal users but also seriously impair microblog-based applications. We propose to detect marionette users with two types of discriminative information:(1)individual user tweeting behaviors and(2) the social interactions among users. By integrating both information into a semi-supervised probabilistic model, we can effectively distinguish marionette users from normal ones.On the level of a post from a single user, we mine a special group of microblog posts, the soft ads which are published by popular Microbloggers to deliver advertising information for financial incentives. Since popular Microbloggers possess tens of millions followers, if they post a message, it could reach an audience of tens of millions.Unlike display and search ads, soft ads typically disguise as normal tweets. Inexperienced users may mistake soft ads for normal tweets and recognize them as true recommendations from popular Microbloggers. Even for those who can distinguish soft ads from normal tweets, mixed tweets bring bad experience to users. Besides the negative impacts on the user side, soft ads could also decrease the income of Microblog operators, as the advertisement owners walk around operators and directly negotiate with popular Microbloggers for advertising. To protect users and platforms from soft ads,we propose to apply the constrained co-clustering model which considers both structural and textual features. The proposed approach can not only alleviate the content heterogeneity problem but also group soft ads and their corresponding owners at the same time.On the level of a pattern within a post, we mine sentiment discriminant patterns from product reviews in the format of item sets. The main challenge to discover these patterns is the combinatorial explosion, since pattern enumeration is NP-hard. Thus we propose ISb FIM, an Iterative Sampling based Frequent Itemset Mining method.Rather than process the entire data set at once, ISb FP samples computationally-manageable subsets and extracts frequent patterns from these subsets. By repeating this process for a sufficient number of times, we can guarantee both theoretically and empirically that the frequent patterns can be enumerated without running into a combinatorial explosion. ISb FIM can be easily paralleled and applied to mine item sets, sequences or structures. We implement a Map-Reduce version of ISb FIM to demonstrate its scalability and adaptability on cross-domain and cross-language product reviews.
Keywords/Search Tags:Social Data Crowd Sequence Labeling Microblog Marionette User Microblog Soft Ads Sentiment Discriminative Pattern Extraction
PDF Full Text Request
Related items