Font Size: a A A

Constraint-Based Frequent Pattern Mining:Novel Applications And New Techniques

Posted on:2015-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1268330428984441Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Mining constrained frequent patterns has been well recognized to be funda-mental to many important data mining tasks with broad applications. However, there are still several challenges in this research area.(1) In addition to pat-tern support, how to design new pattern measures to capture users’real inter-ests in order to satisfy the needs of novel applications?(2) Different from the anti-monotone property of support, the properties of the proposed pattern mea-sures are usually very complex, in other words, they do not satisfy monotone, anti-monotone, convertible, succinct and so on. How to efficiently estimate the bounds of pattern measures for all the super-patterns of the given pattern?(3) Usually, different pattern measures are proposed motivated by different applica-tions, then different bound estimation methods are proposed separately. Does it exist a general framework that can fast estimate the bound of any pattern measure. To address the above problems and challenges, in this paper, we make a focused study of the methodologies and applications for constrained frequent pattern mining. Our contributions could be summarized as:Firstly, we propose a novel application by harnessing the wisdom of the crowds for accurate Web page clipping. Clipping Web pages, namely extract-ing the informative clips (areas) from Web pages, has many applications, such as Web Printing and e-reading on small handheld devices. Although many ex-isting methods attempt to address this task, most of them can either work only on certain types of Web pages (e.g., news-and blog-like Web pages), or perform semi-automatically where extra user efforts are required in adjusting the outputs. The problem of clipping any types of Web pages accurately in a totally automatic way remains pretty much open. To this end, in this study we harness the wis-dom of the crowds to provide accurate recommendation of informative clips on any given Web pages. Specifically, we leverage the knowledge on how previous users clip similar Web pages, and this knowledge repository can be represented as a transaction database where each transaction contains the clips selected by a user on a certain Web page. Then, we formulate a new pattern mining problem, mining top-1qualified pattern, on transaction database for this recommendation. Here, the recommendation considers not only the pattern support but also the pattern occupancy (proposed in this work). High support requires that patterns appear frequently in the database, while high occupancy requires that patterns occupy a large portion of the transactions they appear in. Thus, it leads to both precise and complete recommendation. Additionally, we explore the properties on occupancy to further prune the search space for high-efficient pattern mining. Finally, we show the effectiveness of the proposed algorithm on a human-labeled ground truth dataset consisting of2000web pages from100major Web sites, and demonstrate its efficiency on large synthetic datasets.Secondly, we give two real-world applications for the occupancy and then propose an efficient general algorithm DOFRA for mining occupancy-constrained frequent patterns. Frequent pattern mining is an important data mining problem with many broad applications. Most studies in this field use support (frequency) to measure the popularity of a pattern, namely the fraction of transactions or sequences that include the pattern in a data set. In this study, we introduce a new interesting measure, namely occupancy, to measure the completeness of a pattern in its supporting transactions or sequences. This is motivated by some real-world pattern recommendation applications where an interesting pattern should not only be frequent, but also occupies a large portion of its supporting transactions or sequences. With the definition of occupancy we call a pattern dominant if its occupancy value is above a user-specified threshold. Then, our task is to identify the qualified patterns which are both dominant and frequent. Also, we formulate the problem of mining top-k qualified patterns, that is, finding k qualified patterns with maximum values on a user-defined function of support and occupancy, e.g., weighted sum of support and occupancy. The challenge to these tasks is that the value of occupancy does not change monotonically when more items are appended to a given pattern. Therefore, we propose a general algorithm called DOFRA (DOminant and FRequent pattern mining Algorithm) for mining these qualified patterns, which explores the upper bound properties on occupancy to drastically reduce the search process. Finally, we show the effectiveness of DOFRA in two real-world applications and also demonstrate the efficiency of DOFRA on several real and large synthetic datasets.Lastly, we propose a general framework for bound estimation of any pattern measure. It has been well recognized that constrained pattern mining helps to capture more application semantics, thus increase the effectiveness of mining. One major challenge in this field is how to push the properties of the pattern measures deep into mining process to achieve better efficiency. The usual solution to this challenge is to estimate the bound of a given pattern measure PM for all supersets of an itemset X. However, most of previous studies estimate the bounds for their proposed pattern measures individually, and lack a generic and uniform framework which is applicable to any pattern measure. To this end, we revisit this problem of bound estimation and summarize the commonality of these estimation methods for different pattern measures. The basic idea of our proposed general framework for efficient bound estimation is that the pattern measure is maximized or minimized by assigning any item label to the items in the original supporting transactions. To achieve balance between bound tightness and computational efficiency, we also propose techniques to address this trade-off issue to improve the overall performance. As a case study, we apply this framework onto two typical pattern measures, i.e., utility and occupancy. Additionally, we also present how to apply our proposed techniques to other measures. Our extensive experimental evaluation on real and large synthetic datasets demonstrates the effectiveness of our proposed techniques.
Keywords/Search Tags:Occupancy, Bound Estimation, Smart Print, Frequent Pattern Min-ing, Constrained Frequent Pattern Mining
PDF Full Text Request
Related items