This paper clarifies the cause that leading to the difference between the traditional data mining modeling result and real fact. It adjusts the traditional research findings and applys the findings to the control modeling of complex industrial process.With fast development of information technology, data processing technology and computer technology, data mining technology has been widely adopted in various fields. People in industrial manufacture fields have great stakes in data mining technology, hoping to get modeling and sustainable opitimization of manufacture process.But the characteristic of industrial data, including high dimentions, high noise and non-indepedence, will interfere us to find the causal relationship, which reflect the nature of data, through simple utility of traditional data mining technology, thus hindering the finding of control model that conduces to the perfection of industrial process . The causal inference is a new data analysis theory developed a lot during the past few years. It effectively combined the use of data mining technology and accumulated theoretical knowledge to get reliable causal model through mutual verification and inspiration. This type of causal model could be applied to adjust the industrial process thus improving the manufacture efficiency and product quality and decreasing energy comsuption.The paper will deeply research on the contrary results gained through statistic modeling and mechanism modeling due to the complex characteristic of industrial data. The paper findings are listed as following. Firstly, the erro of independent variables will cause the bias estimation gained through traditional least square method modeling, the adjusted least square method models obtained based on the analysis of the causation of bias problem could reflect the nature of data under specific condition and be adopted to optimize the industrial process. Secondly, the erro of independent variables could result in the failure use of traditional model evaluation methods. The deviation between means of predicted values and measured values could not be used as a simple index to judge the model reliability. The paper predicts the deviation through fomular derivation based on thorough discussion on the causation of problems. Thirdly, the theroretical linear relationship may demonstrate a non-linear relationship and the theroretical non-linear relationship may demonstrate a linear relationship dueto the complex inherent relationship among variables. Correlation coefficient could not be simply used to judge the correlationship among variables. Forthly, simply eliminating the data lays on edges of data distribution during preprocessing will change the data distribution thus causing the bias estimation of learst square method modeling. This type of models could not be used to optimize the industrial process. Fifthly, the change of dependent variable distriburion will result in the illusion that the best estimation of linear relationship is non-linear. The best estimation of data could not be applied as a standard to judge when to use to linear or non-linear models. The derivation formulas in the paper all have been simulated and proved correctly.The paper utilizes the SAS system to provide assistant for data mining research work. Three main utilizations are demonstrated in the paoper, including the adjusted calculation of parameter estimations of two-dimensional linear models based on least square method modeling; the prediction of deviation between means of predicted values and measured values; the prediction of parameter deviation caused by complex correlationship of independent variables.The paper illustrates a variety of causal fators that leads to the contrary results gained through data mining modeling and theoretical modeling. Part of the data mining modeling results has been adjusted. The research findings will conduce to the obtaintion of reliable industrial control models through causal inference thus providing necessary assistance to the optimization of industrial process. |