Font Size: a A A

Research On Online Communities Oriented User Information Mining And Its Applications

Posted on:2015-03-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1268330422492484Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the development of various online communities, there is a hugeamount of user information cumulated on the web, including user account information(e.g. usernames), user demographic information (e.g. gender, age and location), usersocial relation (e.g. friend relation and reply relation) and user generated content. Onone hand, the user information can help enterprises better understand their clients andtarget new clients more accurately. On the other hand, the user information can be usedto build better personalized information systems. Additionally, the user information canhelp sociologists to understand human behavior better. Hence, the technologies of mininguser information from online communities are the keys to build new social applicationsand help understand human behavior.However, there are several challenges for mining user information from online com-munities, including unstructured data challenge, cross-community challenge and no mea-surement challenge. Unstructured data challenge means that the user information in on-line communities are shown as on the web pages in an unstructured way. The diversityand the dynamics of the web page layouts brings challenges to the automatic extraction ofthe user information as structured data. Cross-community challenge means that the difer-ent aspects of the user information are distributed in diferent online communities, whichmakes it difcult to fully understand all aspects of users. No measurement challengemeans that there is no explicit measurement of user characteristics (e.g., user influencelevels and user expertise levels), which makes it difcult to directly apply the user infor-mation. This paper mainly focuses on addressing these three challenges, and explores thethe applications of the user information. Specifically, the main contents of this paper canbe summarized as follows:(1) To address the unstructured data challenge, this paper studies the problem ofextracting usernames from the web pages containing user-generated content. This paperproposes a weakly supervised learning approach. The proposed approach utilizes a smallamount of statistically rare usernames to automatically collect and label large-scale train-ing data, which solves the problem with previous work that requires manually labeledtraining data. The proposed approach relies on only single page features, and addresses the problem with previous work that requires multiple page features. The experimen-tal results show that the proposed approach significantly outperforms the start-of-the-artapproach with single page features, and has comparable performance with the start-of-the-art approach with multiple page features.(2) To address the cross-community challenge, this paper studies the problem of link-ing users across multiple online communities. We define that the problem of linking usersacross multiple communities can be divided into two tasks:(a) the alias-disambiguationtask, which is to diferentiate users under the same usernames; and (b) the alias-conflationtask, which means to find all diferent usernames used by a natural person. In this paper,we focus on the alias-disambiguation task of the user linking problem. We start quantita-tively analyzing the importance of the alias-disambiguation step by conducting a surveyand an experimental analysis on a dataset of About.me. To the best of our knowledge, it isthe first study to quantify the human behavior on the usage of usernames. We then demon-strate an approach to automatically create a training data set by leveraging the knowledgeof the n-gram probability of a username. We verify the efectiveness of this approachby using the dataset of Yahoo! Answers. This approach addresses the problem with theprevious work that requires manually labeled training data. Additionally, we verify theefectiveness of the classifiers trained with the automatically generated training data.(3) To address the no measurement challenge, this paper studies the problem ofestimating user expertise scores as an example of measuring user characteristics. Specifi-cally, this paper considers the problem of estimating the relative expertise scores of usersin community question and answering services. This paper proposes a competition-basedmethod to estimate user expertise score. This method casts the problem of estimatinguser expertise scores as a problem of estimating relative skill levels of players in two-player games. Compared with the link analysis based approaches, our proposed methodsimultaneously models question-answer relation and answer quality information in a u-nified way. Compared with the answer quality based approaches, our proposed methodconsiders the difculty levels of diferent competitions, rather than weighting diferen-t questions equally. The experimental results show that our proposed competition-basedmodel significantly outperforms the link analysis based methods and answer quality basedapproaches on the dataset of active users.(4) Taking an application viewpoint, this paper studies the problem estimating thedifculty levels of crowdsourcing tasks based on the structured, linked and measured us- er information. Specifically, this paper studies the problem of estimating question (i.e.crowdsourcing task) difculty levels in community question and answering services. Thispaper proposes a user competition-based approach to estimating question difculty lev-els by leveraging the measurement of user expertise levels. The measurement of userexpertise levels can help address the problem with previous work that cannot deal withthe partial order observations. The experimental results show the efectiveness of ourproposed model. Finally, this paper studies the problem of calibrating question difcultyscores across communities by leveraging linked user information.In conclusion, this paper not only focuses on addressing the unstructured data chal-lenge, cross-community challenge and no measurement challenge, but also studies anapplication of structured, linked and measured user information, which is the problem ofestimating the difculty levels of crowdsourcing tasks. This research has achieved somepreliminary results, and we hope this can be helpful to other researchers in this area. Webelieve that the development of user information mining technologies will help buildingnew social applications and the research of social science.
Keywords/Search Tags:online communities, user information, username extraction, user linking, userexpertise estimation, crowdsourcing task difculty estimation
PDF Full Text Request
Related items