Font Size: a A A

Research On Web Clustering Algorithms Based On Swarm Intelligence And Random Indexing

Posted on:2012-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:M WanFull Text:PDF
GTID:1488303356472154Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Clustering divides data into meaningful or useful groups (clusters) without any prior knowledge. It is a key technique in data mining and has become an important issue in many fields. In particular, with the amount of all kinds of information and data in the world increasing and the study problems becoming more and more complex, the existing clustering techniques are also facing increasing challenges. So the study about new clustering algorithms is an important issue in the research fields including data mining, machine learning, statistics, and biology.The social insects'behavior such as finding the best food source, building of optimal nest structure, brooding, protecting the larva, guarding, etc. show intelligent behavior on the swarm level. Swarm Intelligence (SI) is an innovative distributed intelligent paradigm for solving optimization problems that originally took its inspiration from the biological examples. It can achieve artificial intelligence by simulating the natural biological behaviors. As the importance of clustering strategies in many fields, global optimization methods based on swarm intelligence have been applied to solve clustering problems. Since criterion functions for clustering are usually non-convex and nonlinear, traditional approaches, especially the k-means algorithm, are sensitive to initializations and easy to be trapped in local optimal solutions. As the increasing numbers and dimensions of data sets, finding solution to the criterion functions of the clustering has become an NP-hard problem.Users of a Web site usually exhibit various types of behaviours associated with their information needs and intended tasks by clicking or visiting Web pages. These behaviours can be traced in the Web access log files of the Web site that the user visited. Web usage mining, which captures navigational patterns of Web users from log files, could detect and analyze the characteristics of Web user behavior patterns of access to a Web site, and therefore identify potential customers and improve the quality of service to users. Clustering techonology is a newly developed paradigm in Web usage mining and Web user behavior analysis. Therein, current Web clustering methods are mostly based on Web sessions Web page content, while there are relatively few approaches to clustering Web users' navigation patterns. Moreover, the conventional Web usage mining techniques for analyzing user behavior only capture stand alone user behaviours at the page view level, but cannot identify the intrinsic characteristics of Web user activities, nor quantify the underlying and unobservable factors associated with specific navigational patterns. Thus, it is necessary to develop new Web user clutering and modeling methodologies to identify the latent factors or hidden relationships among users'navigational behavior and improve the performance of clustering technology effectively.The results produced by Web user clustering can be used in various advanced applications, for example, Web prefetching and catching. Many techniques, including Web Mining approaches, have been utilized or improving the accuracy of predicting user access patterns from Web access logs, making the prefetching of Web objects more efficient. Most of these techniques are, however, limited to predicting requests for a single user only. Predicting groups of users'interest have caught little attention in the area of prefetching.The main works of the dissertation could be summarized as follows:(1) Most existing clutering algorithms have some limitations, such as, limited to a single type of data set, easy to fall into local optimum during search process, and difficult to achieve encouraging results on high-dimensional data sets. To overcome these drawbacks of traditional clustering techniques, according to the characteristics of data clustering applications and on the basis of the existing chaotic ant swarm (CAS) algorithm, in this thesis we propose a clustering algorithm (referred to as the CAS-C algorithm) based on behaviors of ants'chaotic activities. Our work extends the application fields of the chaotic ant swarm algorithm. Numerical simulation experiments show that the proposed CAS-C algorithm has advantages such as not sensitive to initialized centers, finding a global optimum clustering result, and suitable to high-dimentional data and clusters with different shapes.(2) The Bacterial Foraging (BF) algorithm is a new stochastic search technique and optimization model based on the foraging behavior of bacteria swarm. However, as a new kind of intelligent bionic algorithm, BF is still not good enough. The algorithm improvement and parameter adjustment are important issues in the present study of the Bacterial Foraging optimization, where study about clustering based on bacteria foraging behavior is especially rare. Inspired by bacterial foraging behavior, this thesis proposes a new clustering algorithm (called, the BF-C algorithm) based on bacterial foraging optimization. Meanwhile, the thesis also gives out detailed investigations and analysis on setting BF-C parameters in data clustering. Compared to other global optimization-based clustering techniques, the BF-C algorithm is easier to understand, more fast and simple. However, the chemotactic step size is sensitive to the envorionment changement, and needs to be investigated for different condition-settings.(3) Traditional clustering methods are sensitive to the initial values and may get trapped in a local optimal easily. According to the problems and characteristics of traditional Web user clustering techniques, in this theis we introduce the clustering algorithm based on chaotic ant swarm to Web log analysis and user clustering to discover user navigation patterns, and as a result, improve the performance of Web user clustering. To evaluate the effect of the proposed methodology, the clustering results based CAS-C are compared to two methods that are widely used in Web mining (the k-means algorithm and the FCMdd algorithm). Large amount of numerical simulation experimental results show that our proposed CAS-C approach could get more compact and well-separated cluster clusters, and can effectively identify common interests of users.(4) During the process of analyzing and mining Web user access logs, Web user navigation behaviors need to be processed and fomalized to a certain form. Generally this process is called as user modeling. Current Web user behavior analysis and clustering techniques only capture stand alone user behaviours at the page view level, but cannot identify the intrinsic characteristics of Web user activities, nor quantify the underlying and unobservable factors associated with specific navigational patterns. Thus, we propose a Web user modelling approach based on Random Indexing (RI), segmenting and index modeling URL with the concept "context" in natural language processing. Thus, in the user modeling process, hidden information under the browse patterns could be mixed in, and furthermore, help the Web users clustering algorithm effectively and improve clustering results. Clustering experiments are conduct for two kinds of user modeling techniques, the feature vector method and the Random Indexing method, to show the superiority of the RI-based user model.(5) The results produced by our Web user clustering algorithm can be used in various advanced Web applications, such as Web caching and prefetching. Meanwhile, in order to evaluate the results of the proposed Web user clustering approach, we present a program of predicting behaviors of grouped users and Web page prefetching. Common interests of users are summarized through common user profile creation. Furthermore, based on results of Web user clustering, we establish prefetch rules for group users and put pages that users may click in the future into the cache of the Web site. To make our experimental results more convincing, our clustering and prefetching approaches are also compared to the k-means algorithm and the FCMdd algorithm. Numerical experimental results of prefetching show that with the help of the RI-based Web user model, the Web user clustering technique based on CAS-C could get higher accuracy of Web page prefetching.
Keywords/Search Tags:Clustering analysis, Swarm intelligence, Optimization based clustering, Web user behavior analysis
PDF Full Text Request
Related items