Font Size: a A A

Research And Application Of Cross Domain Identification Method For Multi-source Heterogeneous Data

Posted on:2017-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:X W GuoFull Text:PDF
GTID:2308330485485007Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of the era of mobile networks and big data, users take part in more and more network activities. Their actions in the network are gradually recorded digitally so that the data types and data sources of the entire network are getting rich. Although user modeling based on a single data set already had mature approaches, the increasing of multi-source heterogeneous data has brought bigger challenges. It is hardly for a single data set to roundly depict the diversified characteristics of the user and it is difficult to build an accurate user model. Research on multi-source heterogeneous data will create more value. Therefore, the cross-domain association study on the multi-source heterogeneous data has become one kind of new research tendency.At present, most of cross-domain association studies compute the similarity between the two accounts to judge whether they are the same person or not, however, these traditional methods of judgement do not adapt to the large-scale data. In order to solve the existing problems in this area, the thesis does some research about user modeling technology in social network in the first place, and then establishes the cross-domain association model, at last studies the methods of cross-domain association in the large scale data. Specific work of this thesis reads as follows:Firstly, establish the cross-domain association model. The thesis makes analysis of the existing form of multi-source heterogeneous data in social networks, and proposes the user feature vectors based on the user behavior patterns which are mainly divided into four aspects: user profiles, user generated content, user behavior and the network relationship. On this basis, the thesis proposes a novel cross-domain association model based on one-to-one match. The model will get the user similarity by unsupervised methods and treat it as the weight of the edge, then take use of the matching algorithm to get one-to-one matching results. Experimental results show that the matching algorithm is better than the simple supervised or unsupervised method, and its f1-score can reach 90%.Secondly, establish the cross-domain association model on large scale data. Two models are proposed in this thesis. The first model which is based on simhash will map the high-dimensional user tags onto a low-dimensional space, and then use the one-to-one matching model to associate the users. The second model which is based on the inverted index will select the possible similar users as a candidate user set, and then associate the users. The result of the experiment shows that the simhash model is better than the inverted index model and its f1-score can reach 89%, which means the method based on simhash can improve the efficiency of time and space without the loss of accuracy.The above work can solve the cross-domain association problems on large scale data. The further work of this thesis is to study the cross-domain association method which based on semantic tags, and apply the model based on simhash to distributed environments.
Keywords/Search Tags:Cross-domain Identification, User Modeling, Multi-source Heterogeneous Data, Simhash, Inverted Index
PDF Full Text Request
Related items