Font Size: a A A

Research And Implementation Of Data Pre-processing Algorithms In Web Log Mining

Posted on:2012-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:P YangFull Text:PDF
GTID:2178330335960206Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet, especially Web global popularization, this makes Web information abundant. Through Web mining, we can draw necessary knowledge from Web page:to analyze the contents to total user receive and visit behavior and frequentness, we can get the general knowledge of behavior and mode of users, and use that to improve our web serve. And more importantly, through the understanding and analyzing of user's characteristic, it can help and develop the electronic commercial activities.Web log mining utilizing the technology of data mining to analyze and mining the data of network, obtains the visited the valuable patterns of information about Web. It is applied to personalization, improving Web sites and business. And data preprocessing plays an essential role in the process of Web log mining. User identification and session identification are all basal and pivotal process in the data preprocessing. This thesis will research how to improve the accuracy of user identification algorithm and session identification algorithm.In this thesis, the process of data mining, web data mining and web log mining was reported, the technology and process of web log mining was focused on, the method of data pre-processing is researched, including user and session's identification technologies. The mostly work of this thesis is:Firstly, an inspired rule-based user identification algorithm is presented. The algorithm uses IP address and time information etc to identify different users in the web log. Our experiments result prove that the active user based algorithm shows much better performance over the basic algorithm even for small web log sizes. Secondly, we give the definition of session identification, the traditional method of pre-established time threshold is optimized, and the algorithm is described concretely based on the new method which can count a dynamic variable threshold of interval time. The empirical analysis proves that the quality of session is improved.
Keywords/Search Tags:Web Log Mining, Data Pre-processing, User Identification, Inspired Rules, Session Identification, Time Threshold
PDF Full Text Request
Related items