Algorithms For Cross-level High Utility Itemset Mining

Posted on:2022-07-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2518306569994819

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Data mining has become a major field of research in recent years.One of significant tasks of data mining is pattern mining,the goal of which is finding patterns meeting a set of requirements specified by the user.An emerging topic in pattern mining is high utility itemset mining(HUIM),the goal of which is to find all high utility itemsets in a quantitative transaction database.A high utility itemset(HUI)is a set of items(products)that yields a profit that is equal to or greater than a minimum utility threshold,specified by the user.Finding HUIs is useful for decision makers to understand customer behavior as it reveals the most profitable sets of products,and can also be applied to other domains.However,a major limitation of current HUIM algorithms is that they ignore information about item categories.To find HUIs involving item categories,an algorithm named ML-HUI Miner was recently designed.However,this algorithm has two major limitations.First,it can only find HUIs containing items that are all from the same taxonomy level.Thus,ML-HUI Miner misses all profitable itemsets containing items from different taxonomy levels.Second,ML-HUI Miner uses a simple approach to find multi-level HUIs as it mines each taxonomy level separately rather than using the relationships between categories to perform search space pruning.There exist algorithms for discovering relationships between items from different taxonomy levels for the task of frequent itemset mining.However,these techniques have not been generalized for the more difficult problem of HUIM.For this reason,a new pattern type called cross-level high utility itemset is defined in this study.The properties of this pattern are carefully studied and the following theoretical results are obtained.The search space for cross-level high utility itemset mining is much larger than that of traditional problem and is also complex to explore because each item can be generalized or specialized to create more itemsets.In order to be able to efficiently find all cross-level HUIs,two kinds of extensions have been defined and designed so that all potential patterns can be systematically explored without having to access them more than once.The first type of extensions is called a join-based extension.For any itemset X,the join-based extension of X is appending item y to X,where y ? AI,y(?)i,?i ? X,y ? Desc(i,?).The second type of extensions is called a tax-based extension.For any itemset X,the tax-based extension of X is replacing the right-most item y with a descendant of y.Based on these two types of extensions,a novel search space is defined.It can be represented as a set-enumeration tree.In order to avoid repeatedly scanning the database,a vertical data structure called tax-utility-list is proposed,which extends the utility-list in HUI-Miner with category information.The tax-utility-list for itemset X contains a tuple<TID,iutil,rutil>,where TID denotes a transaction containing the itemset X,iutil denotes the utility of X in TID,and rutil represents the remaining utility of X in TID.In addition,the taxutility-list contains the childs field,which is a set of pointers to the tax-utility-list of X's specializations.According to the previously proposed search space exploration operation,all crosslevel HUIs can be found by exploring all extensions of the items and calculating their utility.While this approach is feasible,it is inefficient because the search space is very large for many real-life databases.To design an efficient algorithm,two novel strategies for pruning the search space to find all patterns without having to compute the utility for all possible itemsets are proposed.The first strategy is called Generalized-Weighted-Utility(GWU)measure and defined as the sum of the transaction utilities containing the itemset.This measure is a monotone upper bound on the utility.Therefore,for any itemset X,if the GWU(X)is less than the minimum utility threshold,then X as well as all its extensions are low utility itemsets.The second strategy is called REmaining-Utility(REU)measure and defined as the sum of utilities and remaining utilities in the tax-utility-list of an itemset.This measure is also an upper bound on the utility.Therefore,for any itemset X,if the REU(X)is less than the minimum utility threshold,then X and all its tax-based extensions are low utility itemsets.Based on the tax-utility-list structure and search space pruning strategies proposed above,the CLH-Miner algorithm is designed to mine all cross-level HUIs.It takes a transaction database D,category information ?,and a minimum utility threshold minutil as input,and outputs all cross-level HUIs.The performance of the algorithm was evaluated on four datasets(Foodmart,Liquor,Fruithut,and BMS).The results show that the proposed GWU pruning strategy can significantly improve the execution time of the algorithm.In addition,the algorithm also finds patterns that cannot be found by traditional HUIM algorithms.The experimental results also revealed that the number of patterns mined is usually large,which is not conductive to presenting and providing decision support to the user.To address this limitation,two compact representations of cross-level HUI are proposed,namely the closed cross-level HUI and the maximal cross-level HUI.An itemset X is called closed cross-level HUI if it has no proper superset Y such that sup(X)=sup(Y)and it has no proper specialization Z that sup(X)=sup(Z)and u(X)? minutil.An itemset X is called a maximal cross-level HUI if it has no proper superset Y in a database such that u(Y)?minutil,and it has no proper specialization Z such that u(Z)?minutil.Based on these two new representations,the CLH-Minerclosed algorithm and the CLH-Minermaximal algorithm are presented in this study.Compared to the CLH-Miner algorithm,the inputs for the two algorithms are the same,while the two algorithms add a CheckClosed procedure and a CheckMaximal procedure,respectively.While the above algorithms can find interesting patterns,a problem is that it is not intuitive for the user to set the minimum utility threshold by themselves,which significantly affects the number of patterns found and the performance of the algorithm.The user often has to run an algorithm several times to find a suitable threshold value in order to get enough patterns.Therefore,a third compact representation called top-k cross-level HUI is proposed.Mining the top-k cross-level HUI means identifying the k patterns with the highest utility in the database.An algorithm called TKC is proposed to allow the user to directly set the number of itemsets k that they want to find.The inputs of the TKC algorithm are the transaction database D,the category information ?,and the number of patterns to be found k,and the ouput are the k itemsets with the highest utility.The TKC algorithm was compared with the CLH-Miner algorithm.CLH-Miner was run with minutil set to the optimal case in order to obtain the same k patterns found by TKC.The results indicate that the difference in execution time between the two algorithms is small,which is relatively impressive considering that the top-k problem is more complex.In summary,a new pattern,namely cross-level high utility itemset,is defined.Novel search space and data structure are proposed to address this problem.In order to mine this pattern efficiently,a depth-first algorithm based on two pruning strategies is proposed.In addition,three compact representations of this pattern are proposed and algorithms are designed for each representation.Finally,quantitative experiments are performed on real datasets to show the effect of different parameters and pruning strategies.Analysis of the mining results shows that the proposed patterns correctly interpret the data and help to further support decision-making.

Keywords/Search Tags:

data mining, pattern mining, high utility itemset, taxonomy, cross-level

PDF Full Text Request

Related items

1	Research On Novel Methods In Utility Pattern Mining
2	Research On Key Technologies Of High Utility Itemset Mining
3	Research On High-utility Itemset Data Minging Based On Distributed Platform
4	Research On High Utility Pattern Mining Method For Big Data
5	Research On High Utility Itemsets Mining Algorithm
6	Research Of High Frequent-utility Itemset Mining
7	Multi-Relational Frequent Pattern Mining Algorithm And Its Application Research
8	Research On Accelerating High-utility Itemset Mining Based On Spark
9	Design And Implementation Of High Utility Itemset Mining Algorithm On Massive Data
10	Improvement And Application Research Of High Utility Pattern Mining Algorithm Over Data Stream