| User Identification is essentially a De-Anonymization problem.In the realistic identification task,its purpose is generally to find the most similar user from non-anonymous groups based on the behavior data of anonymous users.The user's behavioral data refers to the traces of operations left by users in various network and communication services,often containing their own behavioral patterns,representing their preferences and habits in service consumption.According to the user's behavior data,we can perform their behavior patterns,and achieve the purpose of identifying the identity of anonymous users through a match between behavior patterns.In this paper we focus on general identification issues and explore common methods for various identification scenarios.Firstly,the identification method based on feature distribution histogram is studied,and then the correlation of behavior features in time dimension is introduced.Based on this,a identification method based on feature sequence is proposed.In this method,firstly,all the feature sequences of the user behavior features on the timeline are obtained by using the n-gram model;then the set of feature sequences according to the heat order is constructed according to the TF(Term Frequency)value of the sequence as a representation of the user behavior pattern;Finally,we propose a matching method of ordered sets,matching the anonymous user with the known user's feature sequence set,and selecting the user with the highest matching similarity as the identification result.In this paper,the above methods are experimentally verified in three different realistic scenarios,and some common problems in the identification task are discussed.Firstly,experiments show that in the three scenarios of this paper,the accuracy based on the feature sequence method is always not lower than the classical feature histogram method.In the user shopping and web browsing scenarios,the accuracy is increased by 10% and 7% respectively,and time is reduced.Secondly,this paper focuses on the problem of less anonymous user data often encountered in realistic identification tasks.In this problem,the method based on feature sequence is better.Finally,the feature sequence-based method can be used to distinguish users with distinct features.Experiments show that the accuracy can reach 98% in the user shopping and TV viewing data sets.Therefore,we can have a high degree of trust in their identification results,which is of great significance in some practical applications. |