| Sequential patterns mining is the data mining technology applied on sequence databases. It aims at finding relationships between sequential events and specific ordering among them. Sequential pattern mining is the extension of association rule mining and is widely applied in customer behavior analysis, web browsing pattern analysis, scientific experiments analysis, early diagnosis of diseases, forecast of natural disasters, DNA sequence analysis and so on. There have been great advances in the research and application of sequential pattern mining techniques, while there still exists key issues, e.g., high complexity of algorithms, low efficiency for large scale datasets and bad adaptability. This dissertation focuses on sequential pattern mining methods and its application in web usage mining using data mining methods and genetic algorithm theory. The main contributions of the dissertation are summarized as follows.First, the data mining concepts and various data mining techniques for different types of data are presented, and the development of data mining are reviewed. The clustering technique is introduced specifically, including the basic theories, algorithms, and the detailed process.Second, since the k-means algorithm is sensitive to noise and outliers, and is easy to be trapped in local optima, and especially the number of clusters has to be specified a priori, the Genetic k-medoids algorithm (GKMD) is presented to improve the disadvantages. The GKMD adopts the number of clusters as a variable in the fitness function. The chromosome encodes the number of cluster coding with the position of medoids, and corresponding crossover and mutation operators are designed. Therefore, the GKMD algorithm can determine the optimal number of clusters in the evolution process. Except for the global search capability of the GA, the GKMD algorithm uses effective heuristic search methods to enhance the local search ability. Experiments illustrates that the GKMD algorithm performs robustly on datasets with noise and outliers, and can both determine the optimal number of clusters and obtain higher clustering accuracy.Third, a novel two stage scheme for mining sequential patterns is proposed. It clusters the sequences into several groups in the first phase. The n-tuple data structure is designed to represent sequences and reduce the dimensionality. A more understandable and accurate method for measuring similarities among the above sequences is presented. The new similarity measure SMCS captures more specific information about sequences so that the similarity is computed more accurately. In the second phase, stratograms are employed to visualize the patterns. Stratogram provides more information, like frequency of the sequences, which helps discover and extract significant patterns.Fourth, the proposed sequential pattern mining method is expanded and used in the web usage mining. An ontology-based representation for web sessions is proposed and the corresponding semantic web session clustering and visualization method is presented. A new similarity measure for the semantic web sessions called SMSCP is defined on the semantic common paths of users’navigation. Various factors related with web users’interests are included in SMSCP. The web sessions are clustered using the improved k-medoids algorithm and the single link hierarchical algorithm separately. The stratogram are employed to visualize the clustering results. The validity of the similarity measure is verified by comparison with other similarity measures on specific dataset. The experimental results represented by stratograms also validate the effectiveness of the proposed similarity measure. The knowledge extracted from the stratograms helps make recommendation for users’navigation or optimize the web site structure for site designers. |