| Along with the rapid development of Internet technology, the Internet makes a profound impact on people’s lifestyles and brings convenience to people’s daily lives. Through the Internet, people can easily access information.It is also free to communicate with each other.The Internet provides a new method of learning, entertainment,communication, sharing, which occupies an important position in people’s lives. In the environment of network community, the concept of virtual identity of network user gradually gets people’s attention.Nowadays, all kinds of websites and applications require users to register and log in when they use it, so they will generate a lot of data with virtual identity information in people’s daily network access behavior.These massive virtual accounts contain the user’s personal information. Although these virtual identity information and the user’s real identity are not exactly the same, they will surely have some potential links. So we can use some means of data analysis to deal with these massive virtual identity, from which to extract useful information.We are able to get the user’s identity characteristics, such as gender, age, interests, hobbies and so on which brings ues a more profound understanding of users. And then for different users, we can according to their network behavior to provide a better service experience with some personalized information; On the other hand, from the provider point of view, they can largely reduce the cost of pushing information by providing targeted services.In the face of massive data processing problems, the traditional performance of a single computer immortal can not meet the huge computing needs. So we need some efficient way to deal with the data.Apache Spark, which is a distributed system, is now widely used in the massive data processing.This thesis firstly introduces the basic concepts of virtual identity of network users and the virtual identity data mining at the present stage.Secondly, it introduces the frame structure and operating mechanism of Spark platform, the programming model theory of MapReduce and the distributed storage architecture of HDFS. Storage, preprocessing and data analysis is described in detail. Then, it describes the algorithm of massive virtual identity datan and virtual data mining on Spark platform in detail.At last, it introduces the virtual data mining algorithm and the process of algorithm realization on Spark. And then, it makes analysis and interpretation of experimental results. |