| Causality is a ubiquitous connection in real world, and it is one of the most important scientific research area. In the area of scientific research, causality has a better explanation than correlation for providing decision makers with accurate knowledge. At present, the causality stand out as a novel research model in the data mining area, and a number of classical causal inferring model had emergenced. Among them, a one-to-one causal inferring model, which is called Additive Noise Model (ANM), had presented to infer the causality in both simulated continuous and discrete data with high accurate (greater than 90%). But ANM still has room for improvement, especially in the scientific research area of high-dimensional discrete data. Current research shows that the ANM had two major technology bottleneck:1. ANM cannot infer the causality in many-to-one causal structure;2. ANM cannot distinguish the indirect causality.In this graduation thesis, three-part research process is presented to advance and improve the ANM for dealing with the above two problems. The detail of this three-part study are shown as follows:1. Designing and realizing a generalized causal inferring model based on ANM to deal with many-to-one causal structure;2. Improving and relizing the generalized causal inferring model for discret low-dimensional data;3. Optimizing the generalized causal inferring model with heuristic algorithm for applying this model in discrete high-dimensional data.The above three research steps follow the study routes from easy to difficult, which present improvement strategy for the generalized causal inferring model to deal with the increasing dimension of target data, and applying this model in several big data research areas.This paper focuses on the content and results are as follows.1. To deal with the problem that ANM cannot infer the many-to-one causal structure, this graduation thesis presents a generalized causality inferring model based on ANM (Additive Noise Model for Multiple-causes Discovery, ANMMcD) to fill in the gap of causality research in discrete data, and ANMMcD is applied to study on low-dimensional discrete causal inferences firstly. The many-to-one causality structure is very common in real discrete data, and ANMMcD employs multivariate statistical and ANM to infer many-to-one causal structure. This model can accurately infer the many-to-one causal structure in discrete data, and it solved the problem that ANM can infer the one-to-one causal structure in discrete data only. Therefore, ANMMcD propose the theoretical basis for causal inferring research in discrete data. In the application process, ANMMcD perform well in software project risk analysis data (data contains 27 risk factors) that this model can infer the direct risk factors of the final outcome of the project, and it perform better than the existing v-structure causal inference models (Bayesian Network with Causality Constraints, BNCC) and others classical feature selection algorithms.2. Combining structure learning method and ANMMcD as a novel generalized causality inferring model (Multi-causes Discovery with Structure Learning, McDSL), and it was employed in lowly dimensional discrete data. The indirect factors in high-dimensional discrete data is the major problem for causal inference model, this study combining the structure learning method and generalized causal inference models to construct a two-steps model for generalized causal inference. McDSL identify potential cause/effect factor of the target factor to distinguish the indirect cause for deal with the interference causelity and the additional complexity. In low-dimensional discrete, the accuracy of McDSL mdoel is better than than the ANMMcD model, and it provides a viable option for causal inferences in the high-dimensional discrete data. In application phase, McDSL model was applied to infer the risk factor analysis of stock returns. Stock return data contain more risk than software project risk analysis risk factors (data contains 50 risk factors) and more complex causality. Experimental results show that McDSL can discover the market investment rules, and it perform better than the published studies of several feature selection algorithm in returns predicting.3. The heuristic algorithm was employed to optimize the structure learning mthod in McDSL, and this this study present a heuristic generalized causal inferring model (Heuristic Multi-causes Discovery, HMcD) in highly dimensional discrete data. HMcD can optimize the computational complexity of generalized causal inference process. This model combines the capability of global search and fast convergence of the genetic algorithm. HMcD can infer more accurate information than McDSL with less complexity in the high-dimensional sample missing discret data. This model is employed to infer the causality in an adverse drug reaction, which contains 1385 biological risk factors and 888 chemical structures of drugs. Experimental results show that HMcD model is accurate to infer causality than other existing algorithm which is based on structural learning of causality analysis model (Causality Analysis model based on Structure Learning, CASTLE).In summary, we propose and improve the generalized causal inference model based on additive noise models. The theoretical inference and simulation experimental results show the performance of each generalized causal inferring model in difference cases. Moreover, our study can used to deal with the caisuality inferring problems in several real-world data mining areas. Therefore, generalized causal inferring model is an effective algorithm in discrete data, and it has great reference value for further research. |