Font Size: a A A

The Research Of Clustering Algorithm Based On Web Log Mining

Posted on:2012-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiaoFull Text:PDF
GTID:2218330338970829Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, people have become more and more dependent on the Internet. The amount of information on the Internet has expanded with the development of web user and web resources. People began to lose directions in the ocean of information, for example, users don't know how to find out the small part of interests from the huge amount of information, businesses don't know how to improve their mode of operation, and the websites don't know how to improve their own site. All these issues lead to a new research direction named web mining. Web log mining is the most important research direction in the web mining now. The so-called web usage mining is the process of introducing the web log to the traditional data mining to find interesting things. Web log data has some unique features compared with the conventional data, which brings so many challenges to our research work. Currently, there are more and more researches on web mining, especially on web log mining.In this thesis, after reading so much thesis and materials, the basic theories of web log mining and clustering are introduced. Some modified algorithms to improve the defects of traditional algorithms are also proposed and demonstrated them by experiments.First of all, the current research statuses are introduced both at home and abroad. The foreign research is more mature compared with domestic one which is still studied on theories stage. Then, the introductions of data mining and Web log mining theories are given in this thesis, including the process of mining and the feathers of mining data.Then theoretical knowledge and current research methods for data preprocessing are focused on in the web log mining. There are data cleaning, user identification, session identification, path added and transaction identification in the process of the data preparation. In the current study, an improved method is proposed in session identification. The user access tree to identify the sessions with the absence of site topology is used, and the transactions can be got without the process of the path added. Pages are classified by the navigation pages and content pages in the process of transaction identification. A transaction is treated as invalid and deleted if all of pages in these transactions were navigation pages, which greatly reduced the capacity of the database without losing the information.The theories of the clustering are described in detail, and give the definition about the page interests and user similarity at last, and then focus on the K-means algorithm and analyze its advantages and disadvantages. A modified method is proposed to improve the disadvantages about the original K-means algorithm on the selection of the initial focal point and the impact about the noise data, which use a vague partition to divide the data, and then adjust the partition to get K by density method in high-density areas. This K value is more reasonable than the K value obtained by subjective experience, and also the initial cluster centers can be chosen from the K high-density areas, which is more stable and reasonable than the cluster centers obtained by random. Then, the weight to the calculation of the average data is introduced to reduce the impact of the noise data. The center of mass calculated by weight average can be closer to the cluster stack with short distance data, which can weaken the effect of the calculation about the cluster center with isolated data deviating from the cluster stack. Then the effectiveness of the improved algorithm is tested by the experiment of the stand and data, and finally applies this algorithm to the Web log data and gets the result of the user transaction clusters, which is better than the original algorithm by the calculations of the similarity within the clusters and the dissimilarity outside the clusters.Finally, this thesis is summarized and some problems faced in the process of the study are proposed, and the future research directions are put forward.
Keywords/Search Tags:Web log mining, data preprocessing, transaction identification, user similar, clustering
PDF Full Text Request
Related items