Font Size: a A A

A PROV-based Process Analysis Method For Improving Interpretability Of Data Mining Results

Posted on:2017-05-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:J KeFull Text:PDF
GTID:1368330542966599Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays,many enterprises emphasize the ability to make decisions based on data.As a typical data analysis technique,data mining has played an increasingly important role in enterprises' decision making,since it can find useful information hiding in massive data.However,an indisputable fact is that the interpretability of data mining results(refer to models and patterns)is far from satisfactory,though many data mining results were discovered from data.Because decision makers don't participate in data mining process,the process is just like a black box and makes decision makers don't know how the results were produced.When decision makers get these results,they may ask:what data is the model based on?How was the data processed?Compared to other models,how was this model selected out?At present,these questions can't be completed answered,making decision makers difficult to understand and trust mining results.Interpreting mining processes is an effective way of improving the interpretability of mining results.According to CRISP-DM process standard,data mining process is iterative and can't be completed at one time.In order to discovery satisfactory results,technical staffs will modify data mining workflows continuously,and run workflows to produce corresponding process instances,which describe how dataset was transformed to be mining results.Existing methods explain each process instance separately,but still lack of the ability to interpret the whole iterative process.Firstly,they can't explain the evolving process of data mining workflows,making decision makers don't know relation and differences between different process instances,and lack of basis to compare different process instances.Secondly,they don't provide business background information(BBI),decision makers don't know the business meaning of data elements.Thirdly,they lack enough interactivity to support comparison of process instances from different dimensions and granularities.Data provenance,as a kind of metadata that describes the history of a piece of data,helps users to better understand the generated resources.Thus,it can be exactly used to be the basis of describing and explaining mining processes.PROV is the latest standard in the area of data provenance,and describes the whole life cycle of resources in a unified way.Based on PROV,the whole iterative mining process can be described clearly and questions about mining processes will be answered easily.Because of the ability of revealing the history of the generated resources,PROV are widely used in many areas.In scientific workflow area,data provenance has been used to reproduce the computation process,to helps scientists to verify final results.In Web environment,PROV has been used to describe the propagating process of information,to helps users to evaluate information's credibility.This thesis proposes a method to explain data mining processes.The method is based on PROV,and its aim is to improve mining results' interpretability through further supporting the iterative feature that stated by CRISP-DM.Specifically,this thesis completed the following work:(1)Proposed a model of provenance information named PROV-WD to model provenance information produced during mining processes.PROV-WD can describe evolving processes of two resources,one is the evolving process about data mining workflow,and the other is about dataset,to make up the shortage that existing methods can't explain the evolving process of data mining workflows.It refined elements and relations of PROV by combining with specific situation of iterative mining processes,then built a mechanism to describe the evolving processes of these two kinds of resources.Compared with current models,which use PROV to only describe the evolution of one resource,PROV-WD describes the evolution of two kinds of resources at the same time,this is a new type of PROV's application.(2)Proposed a provenance model to integrate BBI with the description of data mining processes.This model extends PROV on the description of business concepts'containment relationship and business semantic relationship,making PROV not only can be used to describe influence relationships,but also business relationships.It also constructed the mapping between BBI and data elements,so BBI can be integrated into the interpretation of data mining processes.This model adds description of BBI into provenance information,making provenance information play a better role in processes' interpretation.The integration mechanism also has a reference significance to the application of provenance information in other domains.(3)Proposed a multi-dimension data model for provenance model,which is named provenance cube(PC),to support the comparison of process instances from different dimensions and granularities.PC was proposed to improve the interactivity for the interpretation of data mining processes.It borrows the idea of multi-dimension modeling from OLAP(Online Analytic Processing),and builds a storage pattern according to "fact-dimension" structure style.In addition,this thesis proposed several forms operation based on PC,including dice,slice,drill down and roll up.Compared to current methods,analysis that based on PC can support comparing process instances from different dimensions more intuitively and easily.(4)Based on RapidMiner Studio data mining platform,this thesis proposed a design solution to implement the prototype system DMAnalyzer with the goal of verifying the feasibility of our method.Specifically,this thesis proposed detailed design plans for three key modules,including capture of provenance information,management of provenance information and query analysis engine of mining processes.It also proposed a case analysis to verify the effectiveness of our method.In order to solve the problems existing in present methods,this thesis proposed a new method to analyze data mining processes.This method not only can provide basis for comparison of different process instances by interpreting the evolving process of data mining workflows,but also provide business semantic explanation for data elements,and increase the interactivity during the analysis of data mining processes through PC and its storage pattern.Based on these improvements,this thesis can better supoport iterative feature of data mining processes,which is emphasized by CRISP-DM,for the interpretation of data mining processes.
Keywords/Search Tags:Data Mining Process, Interpretability, Data Provenance, PROV, Multi-dimensional Analysis
PDF Full Text Request
Related items