Font Size: a A A

Study On Crucial Techniques Of Web Usage Mining

Posted on:2008-06-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:C F LiFull Text:PDF
GTID:1118360272968852Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web usage mining is aiming at discovering the visiting patterns of users and predicting the users'visiting behavior by mining web log record, so as to achieve a better comprehension and service over the application based on Web. The results of Web usage mining are usually the mutual behavior and interest, the searching preferences, habits and patterns of personal users, etc. Therefore, it is theoretically and practically important to offer personified service and customization, to improve the structure and performance of Web system, to reform the websites'structure, to support commercial intelligence for the business organizations and and to recommend web pages to the users, etc.The fact that the content of Web has the quality of being complex, diverse, and unstructured, that the organizational structure of Web is dynamic and changeable, and that the Web usage data is inaccurate, has caused lots of difficulties to Web usage mining, which brought about, on the one hand, the consequence that the traditional data mining technique cannot be correspondingly applied to Web data, but on the other hand, it also offered more challenges and opportunities for a further study on Web mining theories and techniques.The results of data preprocessing will directly affect the results of data mining with different quality. The data of Web usage mining may stem from the server side, client side, proxy server, site files, registration information, or remote agent. Each type of data collection differs not only in terms of the location of the data source, but also the kinds of data available, the segment of population from which the data was collected, and its method of implementation.Before the stage of mining, it is needed to preprocess the data, whose process consists of data cleaning, user identification, session identification, and path completion. The task of data cleaning is to remove the irrelevant and redundant log entries. User identification is the process of associating page references with different users. The goal of session identification is to divide the page accesses of each user into individual sessions. Taking the advantage of heuristic regulations is an effective method to preprocess the data.Web session consists of the sequences of web page accesses. Therefore, the similarities of web page accesses are the base of the similarities of web sessions. In order to attract users, website managers always put similar contents in similar places as soon as possible when designing website structure, so we can observe its static similarities through the URL structure of web pages. Meanwhile, since that the difference in the view time may mean a difference in the users'interests in the corresponding pages, the dynamic similarities of web page accesses can be counted on the base of view time. The similarities of web page accesses are decided together by static similarities based on URL structure and dynamic similarities based on view time.Web sessions, made of the sequences of web page access, greatly resemble to DNA made of amino acid series. When analyzing the characteristics of living things, the similarities between DNA and protein should be discovery, with regard to Web usage mining, the similarities between web sessions helps us achieve a better understanding and analysis on the users'visiting behaviors. Therefore, the classic sequence alignment algorithm of DAN or protein in Bioinformatics can also be mended and applied to the process of measuring the similarities between web sessions.The numbers of clusters, the initial points of the respective clusters, and the defining of criterion function, are the 3 key points and difficulties that deserve consideration in web session clustering. WSCBSI, web session clustering based on the increase of similarity, defines the numbers of clusters according to the knowledge of application fields; it takes advantage of ROCK, a clustering algorithm with a high quality of cluster as well as a complexity in space and time when confronting giant data, to decide the initial points of each cluster; it also determines the criterion function according to the contributions of overall increase in similarities made by dividing web sessions into different clusters --- which not only overcomes the shortcomings of traditional clustering algorithm which merely focus on partial similarities hence brings about an not so good clustering result, but also decreases the complexities of time and space during the clustering process.
Keywords/Search Tags:Web Usage Mining, Data Preprocessing, Web Session Similarity, Web Session Clustering
PDF Full Text Request
Related items