Font Size: a A A

Research On Technique Of Web Mining Based On Log

Posted on:2011-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:B ChengFull Text:PDF
GTID:2178360305972737Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet, Web sites provide more and more information service for people; meanwhile, their structures have become more and more complicated. How to improve the Web sites structure for the convenience of users browsing, how to get Web users interests and hobbies for increasing the sites profit, etc. These have become the focus of Web businessmen attention. In order to solve these problems, the traditional data mining technology be introduced into Web domain, by mining Web logs to get useful information and patterns, which be used to implement business intelligence, providing personal services for Web users, optimizing Web sites and impoving system performance and so on. This is just the Web log mining. Now it gets increasing attention and research from people, for its apparent theoretical and practical significanceThis paper systematically discusses the basic theory and integrated processes of Web log mining, and brings forward some innovation and improvement for some key problems of Web log mining.Firstly, the paper introduces the research background, meaning, data source and main processes of Web log mining, then discusses the whole procedures of Web logs data preprocessing in detail, and also introduces the difficult places and its corresponding solutions of data preprocessing. Based on the major analysis of the current session identification methods and getting their deficiencies, a new session identification method is prompt out. The new session identification method is based on the session definition and the habits of users browsing Web sites. It uses the home page and navigation pages of the site as the sign of the new session beginning; it can avoid the deficiencies of old session identification methods and also can relieve the mission of following transaction identification. By using real Web logs, after data cleaning up and user recognition, the new session identification method and old methods are implemented through PL/SQL programming. The experiment result demonstrates that the new session identification method could identify more sessions and the identification is more precise than the old methods.Secondly, by analyzing the deficiencies of current calculating methods for interest-level on web pages, an improved method be proposed, which is based on the frequency and time of the page being accessed. Analysis verifies that, the improved method of counting page interest-level could reflect the degree of users being interested in pages more veritably.Lastly, all web pages are endowed with corresponding weight by the improved method. The definitions of weight threshold and frequent weighting access sequence are also prompt out. The GSP algorithm is introduced into the mining of user access sequence pattern, in which the pages are endowed with corresponding weight. The experiment illustrates that:the frequent weighting access sequence got from the mining could embody the users'access behaviors more accurately.
Keywords/Search Tags:Web log mining, data preprocessing, session identification, interest-level of page, frequent weighting access sequence
PDF Full Text Request
Related items