Font Size: a A A

Research On Algorithms For Sequential Pattern Mining Based On Concept Lattice

Posted on:2008-07-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y JinFull Text:PDF
GTID:1118360242960148Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Sequential pattern mining is the process of extracting frequent sequential patterns from sequential events or transactions which occurred in a specific order. Sequential pattern mining as well as association rule mining, classification and clustering are the most important data mining techniques. Sequential patterns are similar to association rules and the main difference between them is that sequential patterns indicate the correlation between transactions in a certain order while association rules represent intra transaction relationships. Sequential pattern mining is applicable in a wide range of applications such as the analysis of customer purchase behavior, web access patterns, disease treatments, DNA sequences, and so on. Concept lattice is an effective tool for concept discovery from data, having the ability to embody relationship of concepts in a vivid and concise way using Hasse diagram. Concept lattice has been widely used in information retrieval, digital library, and software engineering, knowledge discovery, etc. Concept lattice has many good properties which make it more suitable for mining frequent patterns, including the completeness of pairs in the lattice with respect to the binary relation, the independence of the order of input variables or attributes, and the convenience of combining with domain knowledge. This paper systematically introduces the basic definitions related to sequential pattern mining and concept lattice, and browses briefly some general and basic methods for sequential pattern mining and concept lattice building. Moreover, we put our emphasis on the algorithms for sequential pattern mining based on concept lattice model. We propose several novel concept-lattice-based models by adding sequential restrictions to concept lattice and making further extensions with multidimensional ordered information and unordered background information support. Based on these models, we develop a series of algorithms to deal with different types of sequential pattern mining tasks efficiently. The major contents can be summarized as follows:(1) We present a new model named ordered concept lattice and develop an incremental algorithm based on ordered concept lattice to discover user traversal patterns on the web. User traversal pattern mining is an important task of web usage mining and belongs to continuous sequential pattern mining. Meaningful user traversal patterns can be helpful to understand the behavior of end users and to improve the design of the websites. First, we derive a method to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. We then define the basic notations of ordered concept lattice such as formal context, partial relation, basic operators, and so on. The proposed ordered concept lattice is then used in discovering frequent user traversal patterns. An incremental and efficient mining algorithm is put forward, and a comparison with related works is also conducted. The proposed algorithm scans the database only once and is able to find maximal frequent patterns more directly which can avoid the generation of some unnecessary candidate patterns. Moreover, its incremental property makes it more suitable to handle with the dataset of user traversal log which usually changes frequently. Experimental results on both synthetic and real-life datasets show that it is superior to the related method AprioriAll in most situations.(2) We propose an algorithm for mining user traversal patterns from multidimensional sequence data. An n-dimensional traversal sequence is an ordered list of (n-1)-dimensional traversal sequences. A multidimensional traversal sequence is able to indicate many important traversal relations among sequential dimensions and make the resulting patterns more useful. In this paper, a compressed representation of multidimensional traversal sequences is introduced and the pair of items and dimensions is integrated into the compressed description. We also design the methods for the computation of the intersection and subset of multidimensional traversal sequences. Moreover, we modify the ordered concept lattice model by taking multidimensional traversal sequences as the intension and implement an algorithm for mining multidimensional traversal patterns. The proposed algorithm not only inherits the incremental property but also reduces the dependency on the number of dimensions. We generate several synthetic datasets and conduct a series of experiments to compare the performance of our algorithm with related works. Experimental results demonstrate the effectiveness of the new algorithm.(3) A novel data model called multidimensional concept lattice is proposed and an incremental multidimensional sequential pattern mining algorithm is developed. Multidimensional sequential pattern mining is the process of mining one or more unordered dimensions of information alongside one or more ordered dimensions of information, which integrates association rule mining and sequential pattern mining. An example of multidimensional sequence is that the students whose ages are between 20 and 29 years are likely to buy HP printers after buying IBM computers some months ago. The age and vocation dimensions provide the background information of the sequence and make it more informative. We design the multidimensional concept lattice to solve the problem efficiently. In fact, multidimensional concept lattice is the extension or generalization toward concept lattice and ordered concept lattice. The intension of multidimensional concept lattice is more informative, which is constituted of one or more ordered task dimensions and one or more unordered background dimensions. The task dimension represents the sequence of transactions ordered by time, while the background dimension tells the user more about the VIII conditions under which each pattern occurs. The structure of multidimensional concept lattice is more complicated and is applicable in a wide range of applications. Multidimensional sequential patterns are much more informative frequent patterns which are suitable for immediate use. The proposed algorithm named TDCL-SB integrates sequential pattern mining and association pattern mining with a uniform data structure and makes the mining process more efficient. Our performance study on synthetic datasets shows the scalability and effectiveness of the algorithm. Moreover, in order to validate the practicability of the approach, we use the transaction dataset of credit card customers of a commercial bank as the test dataset and some interesting patterns are obtained. We conduct another experiment using the tools of IBM Intelligent Miner on the same dataset and prove the consistency of the resulting patterns gotten by the two approaches.(4) We propose the theme of generalized multidimensional sequential pattern mining, which integrates the research on multidimensional user traversal pattern mining and multidimensional sequential pattern mining. We also develop three efficient methods for generalized multidimensional sequential pattern mining. A generalized multidimensional sequence consists of one or more ordered task sequence dimensions and one or more unordered background dimensions. The main difference between a generalized multidimensional sequence and a multidimensional sequence is that the task sequence of the latter is only 1-dimensional and the task sequence of the former is in multidimensional form. For example, an n-dimensional task sequence is an ordered list of (n-1)-dimensional task sequences. Therefore, compared with a multidimensional sequence, the structure of a generalized multidimensional sequence is more complicated and the intension is more informative. In this paper, we give the definition of generalized multidimensional sequential pattern mining and other related concepts. And three algorithms, PrefixMDSpan_Ex, PSFP_Ex and ETDCL-SB, are proposed for generalized multidimensional sequential pattern mining. PrefixMDSpan_Ex is an algorithm by embedding multidimensional background information into task sequences and thus utilizing modified PrefixMDSpan algorithm to deal with the extended task sequences. PSFP_Ex and ETDCL-SB are two methods which integrate sequential pattern mining and association pattern mining in different ways. PSFP_Ex is the extension of PSFP and ETDCL-SB is the modified version of TDCL-SB, which is an algorithm based on multidimensional concept lattice. Our extensive experiments on synthetic datasets show the advantages as well as limitations of these methods. Some recommendations on selecting proper methods are also drawn.
Keywords/Search Tags:Algorithms
PDF Full Text Request
Related items