Font Size: a A A

Design And Implementation Of Web Log Mining System Based On Short Text

Posted on:2019-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LiFull Text:PDF
GTID:2348330563454440Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the internet,many social platforms have developed rapidly and users have spent lots of time accessing them.Interaction information of user based on web are becoming more and more enormous.How to detect highly valuable information from huge web data and then provide references for users and network supervisors has become an urgent problem to solve.The purpose of the mining of the web log is to find out users' behaviors and needs by analyzing web log files.The logs in the gateway server not only store users' access paths and detailed access parameters,but also record users' short text information on social platforms.Compared with traditional log mining,the mining of short text is more valuable and can make important reference for public opinion analysis.Based on the comprehensive analysis of traditional log mining technology and short text mining technology,this thesis analyzes and excavates gateway server logs,focusing on four main problems including user session identification,Session feature dimension reduction,session clustering and short essay topic clustering.The specific work is as follows:1.The identification of user session.Traditional log analysis analyzes access behavior of users represented by IP address.However,IP address is dynamically assigned in the network community at present,which means that the relationship between IP address and user is not fixed.Besides,a user may have different purposes when he accesses web in different time periods.Therefore,this thesis takes the user session as the research object to study user behavior rather than users themselves.It brings higher discrimination and precision,taking no account of higher dimension.2.The reduction of Session feature dimensions.Social platform users can add dynamic links at any time,resulting in a problem of higher dimension of page path in the log and the difficulty of statistics.Therefore,this thesis merges the pages based on the similarity of page paths to reduce page dimension within the session effectively.3.The clustering of user session.In this thesis,the user interest discrete matrix is established based on the users' click and access time,and then k-means++ algorithm is carried out and further improved based on the thought of mini-batch.A small part of data are used to fit the feature of the whole data set.This clustering algorithm issuccessful on sparse user interest discrete matrix with high dimension,and greatly improves the clustering speed with sacrificing a little precision.The topic mining of short text.In this thesis,short text is modeled based on BTM,and the feature frequency of VSM is fused with the feature of word frequency to improve the accuracy of the model.At the same time,the k-means algorithm clustering number k is automatically adjusted based on the within-class and between-class distances,and it is ideal to compensate BTM for the accuracy loss caused by inputting topic number by hand.
Keywords/Search Tags:Web log, MB-kmeans++, Integrating BTM model, BTM precision compensation
PDF Full Text Request
Related items