Font Size: a A A

Measuring Content Quality in User Generated Content Systems: a Machine Learning Approach

Posted on:2012-11-22Degree:Ph.DType:Thesis
University:University of California, IrvineCandidate:Javanmardi, SaraFull Text:PDF
GTID:2458390011455125Subject:Information Technology
Abstract/Summary:
User Generated Content (UGC) has radically transformed the Web from its humble origins as a document-publishing platform. Currently, and most likely in the foreseeable future as well, the Web serves primarily as a social medium, a largely unmoderated platform where millions of people share experiences and knowledge using their own points of view. While this freedom is empowering in general, when left unguided, the Web becomes a cacophony of voices, where fact and fiction, and good information and deception, blur. When faced with poor quality content, users are left with the feeling that nothing on the Web can be trusted.;In order to tackle this issue of trust in unmoderated publishing media, I focus my work on Wikipedia. I set out to devise an efficient mechanism for automatic detection of low quality contributions, commonly known as "vandalism", and, at the same time, detect contributors who systematically behave as vandals. First I mine the Wikipedia history pages in order to extract user edit patterns. Then I use these patterns to derive several computational models of a user's reputation. Secondly, based on these models, I generate several new user reputation features and show that they are strong predictors for locating low quality content. To improve the accuracy of my approach, I extend the feature set by adding other textual features. I describe a method for detecting vandalism that is more accurate than others previously developed.;Because of the high turnaround in user generated content systems, it is important for vandalism detection tools to be scalable and run in real-time. I explain how we can implement the system in a distributed way. In addition, I use cost-sensitive feature selection to reduce the total computational cost of executing our models.;This work is a starting point; but it will prove to be one of great importance if it contributes to a better understanding of user generated content and the methods of measuring and ensuring its quality. The methods I use in this thesis are general and can be applied to numerous other UGCs such as Facebook and Twitter.
Keywords/Search Tags:User generated content, Quality, Web
Related items