Font Size: a A A

A Study On Algorithms Of Weighted Frequent Pattern Mining

Posted on:2009-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:R N GengFull Text:PDF
GTID:1118360278975157Subject:Light Industry Information Technology and Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology, especially emerging of the network technology, our, abilities to collect, store and transfer data have been improved dramatically, and the great amount of data from diverse fields has been generated. Data mining technology has increasingly become an essential tool for data analysis and decision support. Frequent pattern mining is an important problem in the data mining area with a wide range of applications and mining tasks such as Web mining, retailing data mining, association rules mining, sequential pattern mining, path analysis, and so on.Traditional methods for mining frequent patterns treat each item in TDB (Transaction Database) uniformly, while each item has usually different importance in the real applications. To solve this problem, this paper studies the problem of WFPM (Weighted Frequent Pattern Mining) whose principal characteristic is to assign each item of TDB a weight to reflect its importance in the mining process. In the last decade, the FPM algorithms have achieved a more mature theoretical foundation, however the WFPM and its applications are at an early stage of development, and it is worthy of in-depth study and exploration. The downward closure property (i.e., antimonotonicity) of the unweighted support measure in the weighted case no longer exists and previous algorithms cannot be applied directly. So the main focus in the WFPM algorithms is how to let weighted support or interesting measure satisfy the downward closure property so as to prune the search space efficiently. Taking Web-based frequent traversal path mining and the data mining in retailing for examples, this dissertation sets forth the related methods about mining weighted frequent traversal pattern based on WDG (Weighted Directed Graph) and weighted highly-correlated frequent pattern thoroughly.Data mining on graph traversals has been an active research during recent years. Graph and traversals on it are widely used to model several classes of accessing data in the real world, and the most representative example is Web path accessing. The structure of Web can be simulated by a WDG, in which the vertices represent the related web pages or websites, and every directed edge between two wertices stands for the hyperlink from one vertex to another one, and the weights of vertices or edges reflect their own different importance. However, traditional traversal patterns mining algorithms hardly considered weighted traversals. To solve the problem of weighted traversal pattern mining, the dissertation makes an intensive study as follows.The dissertation generalizes the categories of WFTPM (Weighted Frequent Traversal Pattern Mining). From the user-type perspective, Web weighted frequent traversal pattern mining can be divided into two classes: one is the IWFTPM (Individual WFTPM), the other is HWFTPM (Holistic WFTPM), and from the mining methods, into common TPM and sequential TPM.In addition, this dissertation proposes the WDG model and divides the categories of WDG into two types, i.e., VWDG (Vertex-Weighted Directed Graph) and EWDG (Edge-Weighted Directed Graph). Moreover, the transformational methods between VWDG and EWDG are presented in detail, which simplifies the problem of WFTPM based on WDG traversals and improves the theoretical basis of patterns mining based on graph traversals.Based on the WDG model, the thesis accomplishes the several following contributions.For the problem of IWFTPM, the dissertation devises a novel property among the weighted traversal patterns—scalable property, and converts candidate patterns' pruning problem into corresponding patterns' scalability-checking problem. Based on the scalable property among the weighed patterns, the paper puts forward a WDG structure-based frequent traversal pattern mining algorithm named WFTPMiner (Weighted Frequent Traversal Pattern Miner), and presents two strategies to implement the algorithm—strategy EGTG (Evaluated by Global Topology of Graph ) and strategy ELTG (Evaluated by Local Topology of Graph).When the minimum support threshold is lower, algorithm WFTPMiner is inefficient to mine weighted frequent patterns. To overcome this deficiency of WFTPMiner, a revised expression of weighted support is presented. Then, based on the weighted FP-tree pattern growth method, two efficient algorithms—CWFTPMiner (Closed Weighted Frequent Traversal Pattern Miner) and WTMaxMiner (Weighted Traversal-based Maximal frequent pattern Miner) is devised, among which the former is used to mine closed WFTPs and the latter is used to mine maximal WFTPs. In addition, in the process of implementing above two algorithms, the thesis analyzes the possible information loss case resulting from the wrong order of incorporating the closure constraint with weight constraint, and brings forward the right sequence of combination between two above-mentioned constraints.By analyzing the shortcoming of weighed FP-tree pattern growth method and the continuity property of the items in the traversal path, the paper regards a traversal pattern as a sequence pattern. Based on the above fact, a WDG-based weighted sequential pattern mining algorithm named WTSPMiner (Weighted Traversal-based Sequential Pattern Miner) is presented. This algorithm adopts an improved weighted prefix-projected pattern growth approach to decompose the task of mining original sequence database with weight constraint into a series of smaller tasks of mining locally weighted projected database with a 'divide-and-conquer' strategy. The algorithm pushes the weight constraint into the mining process to accomplish the mining of weighted frequent sequence traversal patterns.To solve the problem of HWFTPM, an algorithm called SWFTPMiner (Statistical theory-based Weighted Frequent Traversal Pattern Miner) is devised. Adopting the notion of confidence level in statistical theory, the algorithm first eliminates the outliers with noisy weights from the original traversal database T to get the revised WDG G_r and traversal database T_r which reflect the most users' traversal behaviors of the holistic. Then, based on the revised G_r and T_r, two mining strategies, respectively called level-wise mining strategy and divide-and-conquer mining strategy, are presented to mine the WFTPs from T_r in detail.In addition, taking the retailing as a example, the thesis brings forward an algorithm called WHFPMiner (Weighted Highly-correlated Frequent Pattern Miner), in which a new objective interesting measure called weighted h-confidence is developed to mine weighted frequent patterns with strong correlation. Adopting the revised weighted support, we prove that the weighted h-confidence has two properties, i.e., anti-monotone property and cross weighted support property. Based on the weighted FP-tree pattern growth method, the algorithm applies two properties of weighted h-confidence to mine the weighted frequent patterns involving items with similar level of weighted support quickly and effectively.The extensive performance analysis show that suggested approaches in this dissertation are efficient and scalable, and are competent for respective task of mining weighted frequent patterns. More importantly, most algorithms presented by the paper have a broad application fields, for example, WDG-based frequent pattern mining algorithms can be applied to various real applications whose characteristic can be simulated by the WDG model expediently.
Keywords/Search Tags:data mining, weighted directed graph, frequent traversal pattern, weighted FP-tree, scalable property, closed frequent pattern, maximal frequent pattern, statistical theory, noisy data, highly-correlated pattern
PDF Full Text Request
Related items