Font Size: a A A

Research On Online Social Network User Classification And Sampling

Posted on:2014-05-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:X CengFull Text:PDF
GTID:1268330425968619Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the advent of more and more online services, a large amountof interrelated data has emerged on the Internet. These data include: user interactioninformation from online social networking, reference information from papers retrievallibrary, consumer comments from e-commerce sites. Besides, there are lots ofnon-Internet generated data, which are interrelated with each other too. For example: inthe biomedical domain, there are combinations data of genes and proteins; and in thetelecommunication industry, there are user communication data. All of these networkeddata have the following salient features: lack of independence between data samples,probability correlations between data attributes and enormous large data size.Based on those characteristics of the networked data, the main discussion iscentered on classification problem of inter-related network data. A series of studies havebeen launched, which include: online data collection, modeling, feature extraction,classification and applications. The main research content and innovation are as follows:1. Firstly, it discussed how to create a universal model for networked data andintroduced a set of classification framework for networked data based on this model.The framework consists of three parts: the Local Classifier (LC), Relational classifier(RC) and the Collective Inference model (CI). There are algorithms corresponding toeach part. In this study, we compared classification performances of all these algorithmsas well as their combination.It is common that most networked data including missing dada, especially for thesmall proportion data type. In order to solve this problem, the misclassification cost isintroduced into the sample weights to optimize the initialization operation of thenetworked data. So that, the missing data of the small proportion data type can beestimated during the local classification stage, preparing more background knowledgefor the following relational classification progress and collective inference progress.2. For online social network user classification problem, the sample distribution onthe test datasets is usually different from that on the training datasets. To solve thisproblem, the transfer learning method is introduced into Naive Bayes algorithm totransfer the test information into the training datasets.3. In the second chapter, it discussed which factors affect the prediction accuracy of the relational classifier. Most relational classifiers are based on homophily, which is acommon feature of all networked data. But, most of the existing indicators only measurethe overall homophily of the entire network. We need to measure the homophily foreach category within the network for the networked data classification problem.Thus, we try to define several homophily indicators specifically for any givencategory to quantify the homophily of the networked data. These indicators are:Edge-centered indexes, Node-centered indexes and E-Index. Experiments show that theE-Index performances the best. The study also found that in relational data classification,prediction accuracy for a given category is only related to the homophily of itself andhas nothing to do with the homophily of other categories.4. In the third chapter, a crawler system based on ranked user sampling wasdesigned especially for Twitter. In this chapter, it introduced the detail of the systemframework and its’ resource optimization strategy.In order to search out the most influential users in real-time, it focused onoptimizing the user-sampling module: a retweeting rate factor p is introduced in theTunkRank algorithm. This factor is being modulated based on users’ interaction status.This factor allows the system to rank users in real-time based on their current interactionstate. Related experiments show that: by introducing the retweeting rate factor p intothe TunkRank algorithm, TunkRank performs better in catching up with users’information update than the traditional ranking algorithms: the PageRank algorithm andthe HITS algorithm.5. Finally, based on above work, it studied two important user relationships onTwitter: the Follow relationship and the retweet relationship. It compared their effect intwo aspects: the effect in propagating user influence and the effect in enhancing the userclassification accuracy.To this end, two variables,V fandVr were defined to separately measure theirability in spreading user influence. The experimental comparison show that: the retweetrelationship played a greater role in propagating user influence.On the other hand, classification experiments were conducted separately based onFollow relationship and retweet relationship. It turned out that Follow relationship ismore helpful in classifying Twitter users. But retweet relationship also helps tounderstand the user’s interaction behavior. In addition, the study also found that usersbelonging to different categories are showing different interactive behavior.
Keywords/Search Tags:relational data classification, homophily, ranking algorithm, data sampling, social network analyzing
PDF Full Text Request
Related items