Font Size: a A A

Based On The Web Server Log Mining Data Preprocessing Technology Research

Posted on:2013-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2248330374486681Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of Web technology and the widespread use ofthe browser. the information type and content on the Web has become very rich. Webbrings people rich information and great convenience,meanwhile, how to discovervaluable information effectively has become a difficulty to the user. A new technologyis urgently needed to automatically find potential and valuable information from thevariety web resources and locations. The Web log mining(also know as Web structuremining) emerged as the times require in this situation. Web mining in server access logsis categories in four phases: data collection, data preprocessing, pattern discovery,pattern analysis. We mainly study about data preprocessing in Web log mining in thispaper.Firstly, the background, source, significance and main content of this work areintroduced, and the existing related work is also introduced.Secondly, the basic process of data mining, algorithms and research significance arebriefly introduced. This paper also reports the comparisons and summary of variousmethods of web data mining with applications, focuses on the Web log mining, andsummarize the correlation technique of data preprocessing in Web log mining.Thirdly, This paper introduces the traditional Timeout method be used for sessionidentification,and put forward a new method about session identification This methodintegrates such factors as user’s standard browsing time, the webpage’s downloadingtime and the webpage’s link structure to calculate every user’s practical browsing timefor every webpage, then we can clean the sessions by user’s interests, and delete thewebpages that the users are no tinterested. to provide more accurate session data fordata mining in later period. The simulation result turns out that the modified sessionidentification method in this paper can identify session effectively.Finally, after data preprocessing, we can take cluster analysis to the session matrix.This paper introduces the traditional K-means algorithm.then put forward an improvedK-means algorithm based on density and distance. The simulation result turns out thatthe modified K-means algorithm can improve cluster’s quality effectively. The main contribution and innovation of this thesis include two aspects on Webmining. The first is the enhanced session identification method integrates user’s interestfactors. The second is the modified K-means algorithm proposed to improve the sessioncluster’s quality.
Keywords/Search Tags:data mining, Web log mining, data preprocessing, session identification, cluster analysis
PDF Full Text Request
Related items