Font Size: a A A

Research And Application Of Web Log Mining Technology Based On Distributed Computing Platform

Posted on:2019-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:H X HuangFull Text:PDF
GTID:2348330545455609Subject:Computer technology
Abstract/Summary:PDF Full Text Request
According to the 40th Statistical Report on Internet Development in China,as of June 2017,the number of websites in China was 5.06 million with an annual increase of 4.8%.More and more enterprises are beginning to pay attention to Web log analysis and mining fields,hoping to mine valuable information.Web log data records in detail the user's access behavior,the face of massive log data,how to deal with the technology of big data,and dig out the valuable knowledge contained in it through various data mining algorithms,which is the research of Web mining Hot spots.Based on the research results of sequence patterns and clustering algorithms at home and abroad,this paper analyzes and analyzes the current typical Apriori algorithm and finds that Apriori generates a large number of candidates for frequent paths and frequently scans the transaction database.Furthermore,an improved algorithm is proposed,which transforms the process of traversing multiple candidate sequences into a multi-mode matching,and realizes the combination of AC automaton and Apriori algorithm,which greatly shortens the time overhead of the algorithm and improves the efficiency of sequence mining In the aspect of clustering algorithm,aiming at the deficiencies of dealing with noise and anomalous data with Fuzzy c-Means algorithm(FCM)and Fuzzy c-Medoids algorithm(FCMdd),a new method is proposed in this paper.And standard S-shaped fuzzy membership function to allocate fuzzy weight to the user's session and the associated URL to deal with the noise and abnormal data in the high-dimensional user session data.Based on the FCM algorithm and the FCMdd algorithm,an improved algorithm is proposed,The objective function is improved from minimizing the sum of absolute errors to the median of minimizing the absolute error,which is the median of the Euclidean distance between the user session sequence and the cluster center,so as to improve the clustering quality.Finally,this paper combines Spark and other open source framework to design and implement Web log mining system based on distributed computing platform,and apply the improved algorithm in the system.The accuracy and effectiveness of the system are verified through experiments.
Keywords/Search Tags:Web log mining, Sequential pattern mining, Clustering analysis, Distributed computing, User behavior analysis
PDF Full Text Request
Related items