Font Size: a A A

Research On Frequent Pattern Mining Methods For Uncertain Data

Posted on:2017-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X M YuFull Text:PDF
GTID:1318330512452195Subject:Management of engineering and industrial engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,data mining technologies are confronted with enormous challenges and unprecedented opportunities.As one of the key issues in the research area of data mining,the problem of frequent pattern mining has been extensively studied in the past 20 years,which has prompted the emergence of many classical theories,effective algorithms and emerging applications.As computationally the most expensive step and major task in association rules discovery,frequent pattern mining has found its way into varied applications,such as marketing management,text mining,public health and other fields.However,the data obtained from real-life applications are inherently noisy or uncertain.The uncertainty in data may be caused by limited technologies and methods,errors of the measurement equipments,communication overhead or needs of individual privacy protection.Furthermore,subjected to the constraints of subjective and objective conditions,a series of uncertainty may appear during data mining,which continuously spreads and accummlates in the process of mining,and results in a huge difference between the mined knowledge and the reallistic situations,and even leads to its insignificance and uselessness in practical applications.However,the traditional frequent pattern mining algorithms do not take more consideration of the imprecise characteristics in uncertain data,and hold the view that the knowledge obtained is generally useful and certain,which leads to an embarrassment of trying to interpret the abnormal results discovered in uncertain data.Obviously,it is inappropriate and unscientific.Therefore,frequent pattern mining in uncertain data becomes an important but challenging problem which attracts increasing attention in data mining research community.In this paper,two typical kinds of uncertain data are considered,which are probabilistic data and error-tolerant data.Our researches focus on probabilistic frequent pattern mining and approximate frequent pattern mining in imprecise environments,then we concentrate on effective and efficient algorithms to address the problem of frequent patterns mining in Traditional Chinese Medcine datasets.The main contributions of this paper are summarized as follows:1.Using Eclat framework,we present a frequent itemset mining algorithm(UBEclat algorithm)in uncertain data with vertical representation.The UBEclat algorithm mines all probabilistic frequent itemsets exactly under the Possible World Semantics.Firstly,a bidirectional ordering strategy is put forward with the purpose of improving the efficiency of frequent item set mining algorithms in probabilistic data.Then based on the definition of probabilistic frequency,the UBEclat algorithm is proposed in a divide-and-conquer way.The experimental results on both real and synthetic datasets show its rationality and validity.Therefore,a new method is provided to settle the problem of mining probabilistic frequent itemsets exactly in uncertain environments.2.The task of computing probabilistic frequentnesses of all itemsets is time consuming in uncertain environments.To tackle the problem of costly computation,we improve the UBEclat algorithm,and propose an efficient approximate mining approach to handle probabilistic data,which is named NDUEclat algorithm.Integrating the Eclat-based frequent itemset mining framework and Normal distribution-based approximation,the NDUEclat algorithm acquires a win-win partnership in large sparse probabilistic data.The performance of NDUEclat algorithm on both real and synthetic datasets,especially on a widely accepted click stream dataset following Zipf distribution illustrates its rationality and feasibility at solving the issue of probabilistic frequent itemset mining in uncertain mobile environments.To the best of our knowledge,the NDUEclat algorithm is the first approximate mining algorithm that concentrates on discovering probabilistic frequent itemsets in probabilistic data with vertical representation.3.Since the problem of mining error-tolerant frequent patterns in imprecise envirments is proved to be NP-hardness,most of the endeavours focus on efficient approximate algorithms to solve such a NP-hard problem.In this paper,we explore a novel approach which is based on the theory of rough set(RST),and develop a RST-based method to mine approximate frequent patterns in imprecise environments.With a transactional information system constructed on the data set under consideration,a transactional decision table is described,then lower and upper approximations of support are available which can be easily computed from the indiscernibility relations.Finally,by a divide-and-conquer way,the approximate frequent patterns are discovered taking consideration of support-based accuracy and coverage.The extensive evaluations of the RST-based method is exerted on both synthetic datasets and real applications.The experimental results show that the RST-based method outperforms the comparative ones in performance of precision.The novel method opens up a new way to improve the quality of approximate frequent patterns mining.4.In order to release users from the burden of setting subtle parameters and to provide better exibility in real-life applications,we propose a novel model of mining top-k approximate frequent closed patterns in imprecise environments,which consists of support-based clustering as data preprocessing,core pattern generation based on rough set theory and approximate frequent closed pattern mining by a divide-and-conquer way.Experimental results in an imprecise Traditional Chinese Medicine dataset show its compactness in expressing the approximate frequent patterns and its effectiveness in the Traditional Chinese Medicine applications.All in all,this paper mainly focuses on frequent pattern mining problems in uncertain environments,including exact mining methods in data with vertical respresentation,approximate mining algorithms in probabilistic data and approximate frequent pattern mining in error-tolerant data.Corresponding solutions are proposed from several different points of views considering the characteristic of data under consideration,the data representation,the format of mining and the choice of mining technology.The experimental results have verified their effectiveness.Our work can provide helpful information for frequent pattern mining in real applications,especially in the field of Traditional Chinese Medicine.
Keywords/Search Tags:anti-monotonicity property, frequent pattern, uncertain, probabilistic data, error-tolerant data
PDF Full Text Request
Related items