Based On The Web Server Log Mining Data Preprocessing Technology Research

Posted on:2013-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z Li

Full Text:PDF

GTID:2248330374486681

Subject:Communication and Information System

Abstract/Summary:

Recently, with the rapid development of Web technology and the widespread use ofthe browser. the information type and content on the Web has become very rich. Webbrings people rich information and great convenience,meanwhile, how to discovervaluable information effectively has become a difficulty to the user. A new technologyis urgently needed to automatically find potential and valuable information from thevariety web resources and locations. The Web log mining(also know as Web structuremining) emerged as the times require in this situation. Web mining in server access logsis categories in four phases: data collection, data preprocessing, pattern discovery,pattern analysis. We mainly study about data preprocessing in Web log mining in thispaper.Firstly, the background, source, significance and main content of this work areintroduced, and the existing related work is also introduced.Secondly, the basic process of data mining, algorithms and research significance arebriefly introduced. This paper also reports the comparisons and summary of variousmethods of web data mining with applications, focuses on the Web log mining, andsummarize the correlation technique of data preprocessing in Web log mining.Thirdly, This paper introduces the traditional Timeout method be used for sessionidentification,and put forward a new method about session identification This methodintegrates such factors as userâ€™s standard browsing time, the webpageâ€™s downloadingtime and the webpageâ€™s link structure to calculate every userâ€™s practical browsing timefor every webpage, then we can clean the sessions by userâ€™s interests, and delete thewebpages that the users are no tinterested. to provide more accurate session data fordata mining in later period. The simulation result turns out that the modified sessionidentification method in this paper can identify session effectively.Finally, after data preprocessing, we can take cluster analysis to the session matrix.This paper introduces the traditional K-means algorithm.then put forward an improvedK-means algorithm based on density and distance. The simulation result turns out thatthe modified K-means algorithm can improve clusterâ€™s quality effectively. The main contribution and innovation of this thesis include two aspects on Webmining. The first is the enhanced session identification method integrates userâ€™s interestfactors. The second is the modified K-means algorithm proposed to improve the sessionclusterâ€™s quality.

Keywords/Search Tags:

data mining, Web log mining, data preprocessing, session identification, cluster analysis

Related items

1	Research, Implementation And Application Of Data Preprocessing Algorithms In Web Log Mining
2	The Research And Implement Of Algorithm On Web Usage Mining
3	Web Mining And Its Applications
4	Research On The Web Log Mining Of Teaching Resources Searching Platform
5	Research On Technique Of Web Mining Based On Log
6	Cluster Analysis In Applied Research, Scientific Data Mining
7	Data Mining And Analyze In Freeway Charging System
8	Data Mining Application To Customers Relations Management In Security Company
9	The Application Of Cluster Analysis Algorithm In HMIS
10	Study On Crucial Techniques Of Web Usage Mining