Font Size: a A A

Idependent User Identification Research In The Data Management Platform Of Internet Service Provider

Posted on:2016-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:C G ShenFull Text:PDF
GTID:2298330452466417Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, especially mobile Internet, more and more usersget information, do online shopping and etc. through the Internet, which leads to a large-scalenetwork user groups. And the users produce a massive Internet web log data when they use theADSL devices provided by the Internet Services(ISPs) to surf the Internet. In this paper, a largeInternet Service Provider company with more than four million ADSL users does the dailycollection of more than400million web log records. And this log data contains users’ interestinformation, such as the users’ propensity to consume, shopping habits, etc. To take full use of theinformation, the company is building a data management platform (DMP) to collect, store andanalyse the users’ web log data. Then it run advertisements accurately based on the users’ interest.We know that an ADSL device is usually shared by multiple members of a family, an officeor a laboratory. In other words, many users in an ADSL surf the Internet using their terminaldevices, such as PCs, smartphones, iPads etc. Unfortunately, the company does not know thenumber of users there which is precondition to the analysis of users’ interest. In summary,independent user identification is a basic function to data management platform.Now, the existing web log user identification technology is mostly used by a single websiteto identify the users communicating with it. However, there are many challenges to useridentification in DMP, like numerous web sites, massive amount of web log data, a wide range ofdata sources and etc. So, in this paper, under the background of building DMP using massiveamount of web log data provided by an ISP company, we design and implement a useridentification system based on the MapReduce, a simple but powerful framework for parallelcomputation, and its open source system Hadoop.In this paper, we first introduce the requirement of independent user identification system andits connection with DMP. Taking into account the difficulties mentioned above, we propose a newindependent user identification process with three stages, including session identification, sessionmerger and user identification.Then, considering the size of web log data and computational complexity, we analyze thethree stages in detail, and give specific MapReduce algorithms and their implementation code. Todo this, firstly, we propose a session extraction algorithm based on time and reference heuristicrules to achieve session identification. Secondly, we merge sessions in different ADSL data groups using different rules:1) cookie approximate rule,2) identical UUID rule,3) approximateaccount rule. Thirdly, we group the data using the itentical account rule and merge one user’s allsession data.Finally, we use the open source distributed platform Hadoop to develop and implement anindependent user identification system. The test results using real web log data show that thissystem can cover about seventy percent of the data, including Taobao, Tmall, QQ, Baidu andother major Internet websites in decending order, and can achieve good anticipation results withthese websites’ data.
Keywords/Search Tags:data pre-processing, user identification, Cookie, MapReduce, Hadoop
PDF Full Text Request
Related items