Font Size: a A A

User Identification And Interest Analysis Of Internet Access Log Data

Posted on:2016-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:C WeiFull Text:PDF
GTID:2298330452966010Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, and the popularity of e-commerce and electronicproducts, the resource of Internet information increases rapidly. On one hand, the users need totake a lot of time to get useful information, on the other hand, information providers such ase-commerce web site, also hope to provide goods information, which the users are interested in orneed, from the huge goods information, based on users’interest.This paper is based on the users’access records of ISP, and implements the personalizedrecommendation of goods information, through analyzing users’access log data and mining users’interest. When the user accesses a webpage, the ISP logs the user’s record information, which isuser access logged records. It contains the information of the user’s access behavior.At first, thepaper researched user identification based on K-Modes algorithm, and analyzed sessionidentification of log records, identified the sets of log records which belong to the same session,then identified the users by clustering. On this basis, this paper proposed semantic disambiguationalgorithm based on Wikipedia, and combined this algorithm, researched user interest analysiswhich is based on semantic, then implemented classification of similar interest feature users,providing the basis of accurate recommendation of commodity information.By combining thehuge amounts of user access log records with the advantage of MapReduce on big data processingand algorithm scalability, the paper designed and implemented these algorithms on Hadoopplatform. To sum up, the paper’s core work contains the following several aspects.This paper researched of user identification technique. User identification is divided into foursteps, includes data cleaning, session identification, Cookie extraction, user identification andmarkers. Data cleaning was to remove redundancy and invalid data, then provided sessionidentification method which combining reference page and time threshold; through analysis ofCookie data, extracted username field that user login web site commonly used, at last, ascertainedeffective six fields to determine independent user, then used user identification technique based onK-Modes algorithm, through clustering to identify independent users from logging, and obtainthe serial number.This paper also researched of user interest analysis technique. First of all, the paper analyzedthe overall business process, and according to the user’s search keywords, it extracted features of keywords through word segmentation; Combined with wikipedia semantic dictionary, it providedChinese semantic disambiguation algorithm, and studied elimination technique of interestcharacteristic key, then got each user’s interest. Based on the word frequency statistics of word onthe same semantic meaning and classification, identifies the weight, by creation of interestcharacteristics classification DataBase, and according to the similarity of each classification wordin classification DataBase and users, to achieve classification of similar interest characteristicsuser.By combining the study of independent user identification and user interest characteristicsanalysis technology, paper designed and implemented the related algorithm in Hadoopenvironment. First of all, the overall architecture of the system was analyzed, the system wasdivided into two sub-systems: user identification and analysis of characteristics of the user; Useridentification subsystem further divided into four sub module, includes data cleansing, sessionidentification, Cookie extraction and user identification. User characteristics analysis subsystemwas divided into three child module, includes keyword extraction, users interested in featureextraction and similar characteristics users mining. The implementation process of each modulewas analyzed in detail and gave implementation of the core code of key modules, to verify theeffectiveness of the subject research technology.
Keywords/Search Tags:Internet accessed log, user identification, the clustering algorithm, users’ interest, semantic similarity, Hadoop
PDF Full Text Request
Related items