Font Size: a A A

Research On User Session Identification And Clustering Technology Of Web Log Mining

Posted on:2009-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhuFull Text:PDF
GTID:2178360245465719Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the swift development of Internet in amount, scale and complexity, web has become an effective platform on which people communicate and process information. Based on so tremendous information in network, how to discover individual information effectively has become a difficulty to users. So technique of Web mining emerges as the time requires, and the technique of Web log mining is an important part in the research field of Web mining. It applies the technique of Data mining to Web server log, and analyses log files to discover users' visiting pattern of accessing sites. There are three processes in Web log mining: Data preprocessing, Pattern discovering and Pattern analysis.In Web log mining, the first process is Data preprocessing. Because most amounts of data are half-baked, noisy, and inconsistent, and their formats are various in real world. For algorithm of Data mining, incorrect input may result in fault or inaccurate result, at the same time, algorithm of Data mining usually process data with fixed format. There are various data in real world, so these data need to be processed into other data which can be used in mining algorithm. Data preprocessing should accomplish these tasks, such as, how to restore data's half-baked and inconsistent in real world, how to eliminate noisy data, how to transform existing data to the format can be used in mining algorithm, how to extract useful data, how to integrate multiple data source, and so on. Data preprocessing is a main part in the whole data mining process. The result of Data preprocessing is the input of mining algorithm, it can influence mining quality directly. So the technique of data preprocessing is an important research aspect in Web log mining. Data preprocessing is processed when log files are transformed to database files. It includes four phases: data cleanout, user session, session identification, transaction identification.This paper further studies the main task of Data preprocessing, and puts forward a new method about session identification in Web log preprocessing and transaction identification according to users' visiting interest. This method integrates such parameters as users' downloading time, the users' interest to pages, pages' information and pages linking into and out to calculate every user's visiting time for every web page, then divides sessions according to individual threshold. After session identification, according to the users' visiting time and pages' interest deletes the pages that the users are not interested in and linked pages, and redefines the Web transaction which is effective page visiting sequence.Experiment turns out that the method in this paper can identify session in which users take long time to visit pages, and merges pages whose threshold is less than fixed threshold to next session, discoverable real session accounts for great proportion, and be similar to users' real visiting intention. At the same time, deletes independent pages according to users' interest to pages, and forms new Web transaction. It provides valuable data for clustering analysis, and improves cluster's efficiency.After data preprocessing, it is time to select a mining technique such as clustering, classifying according to specific demand. This paper analyses cluster's technique and current Web cluster's content and methods. Through clustering Web transaction, we can find the similar users.
Keywords/Search Tags:Web log mining, session identification, interest degree transaction, user cluster
PDF Full Text Request
Related items