Font Size: a A A

Research On Distributed Sequential Patterns Mining Algorithem

Posted on:2009-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:C H ZhangFull Text:PDF
GTID:2178360242493652Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, mass data has been stored in the database or data warehouse in the information-dominated era. In the face of this situation of "information explosion", how extracte the valuable information from the massive data has become particularly important. With the emergence and development of data mining techniques, this problem has been solved for the people. And potential and useful informations and knowledges are extracted from the massive, incomplete, noise, fuzzy and random practical data by a large number of analytical tools.In this paper, the content mainly includes the mining sequential patterns. Sequence pattern mining is an important research topic for data mining, mines the high frequency patterns with time relate, and solved the problem that the association rules does not reflect the correlation of the events in chronological order. Sequence pattern mining technology has been used extensively in the customer buying behavior analysis, network access model analysis, the analysis of scientific experiments, the early diagnosis of diseases, natural disasters forecast, DNA sequence analysis of patterns.The main research content include: multidimensional sequential patterns mining, distributed sequential patterns mining and the approximate mining for distributed multidimensional sequential patterns and so on after studying the exited methods for mining sequential patterns. The main contributions and innovations of this dissertation are as follows:1) The high cost of candidate sequence problems can be not addressed effectively by the traditional methods for mining sequences, so a new mining method SMBR(sequential patterns mining based on bitmap representation) based on bitmap is proposed. The database is represented by bitmaps in the method, and a simplified bitmap structure is presented. First the algorithm generate candidate sequences by SE(Sequence Extension) and IE(Item Extension), and then obtain all frequent sequences by comparing the original bitmap and the extended item bitmap, and quickly produces frequent count by an effective strategy to obtain multidimensional sequential patterns.2) Now some distributed sequential pattern mining algorithms generate too much candidate sequences, increase communication overhead. In this paper, a new algorithm - FMGSP (fast mining of global sequential pattern) on distributed system is proposed. The idea of this algorithm is to compress local frequent sequential patterns into a simple lexicographic sequence tree, avoid transmission of repeated prefixes. Basing on the regular and simple sequences of merged trees, a new pruning method that I/S-E (Item Extension and Sequence Extension) pruning is presented to prune candidate sequences effectually. Therefore, communication overhead is reduced greatly to generate global sequential patterns effectively. The theory and experiments show that the performance of FMGSP is predominant, and it is effectual when mine global sequential patterns for huge amount of data.3) We present a distributed approximate mining algorithm for multidimensional sequential patterns called AMSP (Approximate Mining of Global Multidimensional Sequential Patterns) to solve the problem of mining the multidimensional sequential patterns in large databases in the distributed environment. First, the multidimensional information is embedded into the corresponding sequences in order to convert the mining on the multidimensional sequential patterns to sequential patterns. Then the sequences are clustered, summarized, and analyzed on the distributed sites, and the local patterns could be obtained by the effective approximate sequential pattern mining method. Finally, the global multidimensional sequential patterns could be mined via the model of high vote sequential pattern after collecting all the local patterns on one site. Both the theories and the experiments show that this method could simplify the problem of mining the multidimensional sequential patterns and avoid mining the redundant information. The global sequential patterns could be obtained effectively by the scalable method after reducing the cost of communication.
Keywords/Search Tags:Data mining, distributed sequential patterns, multidimensional information, approximate mining
PDF Full Text Request
Related items