Font Size: a A A

Study On Interpretability In Data Mining Process

Posted on:2019-04-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J QuanFull Text:PDF
GTID:1368330596458555Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data mining is a nontrivial process that reveals hidden,unknown and potentially valuable information from a large amount of data.With the rapid development of information technology,many industries,such as commerce,enterprises,research institutions and government,have accumulated huge amounts of data stored in different forms.These huge amounts of data often contain a variety of useful information.Data mining is widely used.In the process of data mining,machine learning algorithms are usually used to build models.In the field of machine learning and data mining,the interpretability of models has been of great theoretical and practical value.An interpretable machine learning model is more reliable and more likely to be adopted by users.The research of Interpretability has been developed for more than two decades and has achieved rich research results.However,there are still some shortcomings in the current interpretability research,such as the lack of consideration of human cognitive factors,the lack of the studies of unsupervised Learning interpretability.In particular,the current interpretability research mainly focuses on the interpretable problem in modeling stage and neglects the study of interpretable problems in other stages of data mining.To solve these problems,this paper makes a systematic study on the interpretable problems involved in the process of data mining based on the existing research work.The main contribution is as follows:(1)A research framework of interpretability based on data mining process is proposed.In view of the fact that there is no universally accepted definition of interpretability in the field of interpretability research,this paper analyzes the definition and connotation of interpretability.From the point of view of data mining process,this paper proposes an interpretability framework based on CRISP-DM.This framework fully considers the influence of different phases of data mining on interpretability and proposes an “interpretable plane” for the division of interpretability issues in the most important phase(modeling phase).(2)A original data understanding process is proposed.The goal of interpretability research in the data understanding phase is to use appropriate methods to improve understanding of raw data.The original data understanding process proposed in this thesis takes into account the two situations,supervised learning and unsupervised learning,and makes full use of visualization technology,which can meet the needs of users in the phase of data understanding,hoping to understand the data quickly and intuitively to carry out the follow-up work as soon as possible.For supervised learning,the process mainly considers the two dimensions,sample and feature.This process includes understanding the difficulty of the problem,identifying typical samples and important features.For unsupervised learning,the process involves understanding the difficulty of the problem and exploring data.(3)A high-dimensional feature selection framework is proposed.In view of the great influence of the feature selection of high dimensional data on the interpretability of the final model,a high dimensional-feature selection framework is proposed to improve the interpretability of the data set to be modeled.This framework is suitable for the feature selection of high-dimensional data with sparsity.It integrates the sparse column feature removing,filter method and the wrapper method.A feature selection algorithm named SFS often used in practical applications is improved.The improved SFS algorithm is applied to the high-dimensional feature selection framework.The experimental results show that the proposed feature selection framework is effective.(4)An interpretation scheme of black box model based on human category learning theory is proposed.As interpretability is related to human cognitive ability,this scheme integrates the prototype theory,the exemplar theory and the selective attention theory in cognitive psychology.It includes prototype explanation and exemplar explanation.When we interpret a sample,we use the prototypes first.If they don't work well,then we use the exemplars.Section Experiments show that the scheme can effectively interpret the black box models.(5)A clustering method based on user satisfaction is proposed.Since there are relatively few interpretability studies on unsupervised learning and clustering is the representative of unsupervised learning,the interpretability of clustering is studied in this paper.There is no universal index to evaluate a cluster,so user satisfaction is proposed as an index to evaluate the clustering results,and interpretability is included in the index.According to this index,a clustering method based on user satisfaction is proposed,and the experimental results show that this method can effectively improve the interpretability of clustering.In this paper,the research on the interpretability of data mining process to some extent compensates the shortcomings of the current research,and its research results provide a valuable clue for the interpretability research in the field of data mining.
Keywords/Search Tags:Data Mining, Machine Learning, Interpretability, Cognitive Psychology, Clustering
PDF Full Text Request
Related items