User Identification And Interest Analysis Of Internet Access Log Data

Posted on:2016-12-19

Degree:Master

Type:Thesis

Country:China

Candidate:C Wei

Full Text:PDF

GTID:2298330452966010

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, and the popularity of e-commerce and electronicproducts, the resource of Internet information increases rapidly. On one hand, the users need totake a lot of time to get useful information, on the other hand, information providers such ase-commerce web site, also hope to provide goods information, which the users are interested in orneed, from the huge goods information, based on usersâ€™interest.This paper is based on the usersâ€™access records of ISP, and implements the personalizedrecommendation of goods information, through analyzing usersâ€™access log data and mining usersâ€™interest. When the user accesses a webpage, the ISP logs the userâ€™s record information, which isuser access logged records. It contains the information of the userâ€™s access behavior.At first, thepaper researched user identification based on K-Modes algorithm, and analyzed sessionidentification of log records, identified the sets of log records which belong to the same session,then identified the users by clustering. On this basis, this paper proposed semantic disambiguationalgorithm based on Wikipedia, and combined this algorithm, researched user interest analysiswhich is based on semantic, then implemented classification of similar interest feature users,providing the basis of accurate recommendation of commodity information.By combining thehuge amounts of user access log records with the advantage of MapReduce on big data processingand algorithm scalability, the paper designed and implemented these algorithms on Hadoopplatform. To sum up, the paperâ€™s core work contains the following several aspects.This paper researched of user identification technique. User identification is divided into foursteps, includes data cleaning, session identification, Cookie extraction, user identification andmarkers. Data cleaning was to remove redundancy and invalid data, then provided sessionidentification method which combining reference page and time threshold; through analysis ofCookie data, extracted username field that user login web site commonly used, at last, ascertainedeffective six fields to determine independent user, then used user identification technique based onK-Modes algorithm, through clustering to identify independent users from logging, and obtainthe serial number.This paper also researched of user interest analysis technique. First of all, the paper analyzedthe overall business process, and according to the userâ€™s search keywords, it extracted features of keywords through word segmentation; Combined with wikipedia semantic dictionary, it providedChinese semantic disambiguation algorithm, and studied elimination technique of interestcharacteristic key, then got each userâ€™s interest. Based on the word frequency statistics of word onthe same semantic meaning and classification, identifies the weight, by creation of interestcharacteristics classification DataBase, and according to the similarity of each classification wordin classification DataBase and users, to achieve classification of similar interest characteristicsuser.By combining the study of independent user identification and user interest characteristicsanalysis technology, paper designed and implemented the related algorithm in Hadoopenvironment. First of all, the overall architecture of the system was analyzed, the system wasdivided into two sub-systems: user identification and analysis of characteristics of the user; Useridentification subsystem further divided into four sub module, includes data cleansing, sessionidentification, Cookie extraction and user identification. User characteristics analysis subsystemwas divided into three child module, includes keyword extraction, users interested in featureextraction and similar characteristics users mining. The implementation process of each modulewas analyzed in detail and gave implementation of the core code of key modules, to verify theeffectiveness of the subject research technology.

Keywords/Search Tags:

Internet accessed log, user identification, the clustering algorithm, usersâ€™ interest, semantic similarity, Hadoop

PDF Full Text Request

Related items

1	Clustering Based Net User Interest Mining
2	The Study Of Clustering Web Log Based On Userâ€™s Browsing Interest
3	The Study Of Web User Fuzzy Clustering Based On Path Similarity
4	An Identification Method Of Rejoining Mobile User Based On Network Access Item And Procedure
5	Research On Recommendation Algorithm Based On User Interest Partition
6	Research On Semantic-Based User Modeling And Its Applications
7	The Research On The Application Of Web Log Mining Based On User Interest And Fuzzy Clustering
8	Research On Hybrid Recommendation Algorithm Based On User Interest Change And Clustering
9	Mining Usersâ€™ Mobility Behaviors And Calculating User Similarity Based On Mobile Data
10	Hybrid Recommendation Algorithm Based On Similarity Of Feature Attributes