Font Size: a A A

Research On Model Selection For Machine Learning

Posted on:2012-02-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F DongFull Text:PDF
GTID:1118330368978865Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Machine learning classification algorithms all have inductive bias. And model selection, as an important part of machine learning, aims to choosing an optimal model after building multiple models based on given training data, i.e. choosing the proper bias. The optimal model generally has less complexity, and less prone to overfitting or underfitting. The common model selection methods include data reuse techniques, analytically methods, heuristic methods, the performance metrics and model average. Model average methods in machine learning, i.e. ensemble learning, choose and combine some models to build a new model which owns better performance.In this paper, the following aspects of work are researched based on the in-depth exploration of many model selection strategies for machine learning:(1) gROC curve method based on discernible granularity is proposed to compare the performance of classifiers. ROC curve is an important visual model selection method. However, in practice, we can only obtain the experience ROC curve than real ROC curve because the complete and whole data can not be obtained. The uncertainty of experience ROC curve affects the correctness of model selection. So, the concepts of gROC and gAUC are put forward based on analysis of discernible granularity of the scoring sequence, and some relative properties are discussed theoretically. The concepts use rank and score, not like ROC which only use rank, and take full consideration of the score's uncertainty. Then the rationality of gROC is tested by binormal model after the calculation method is given. Using gROC to estimate ROC curve can avoid a large number sampling and improve the efficiency than other confidence band methods.On this basis, two model selection metrics,λAUC andρAUC, are proposed. The similar ratioλreflects the inherent uncertainty of experience ROC curve.λAUC integrates AUC value and similar ratioλ, and is a model selection metric under given discernible granularity.ρAUC takes care of AUC value and the uncertainty of ROC, butρis the average measure under all discernible granularities. Experimental results show thatρAUC is more effective thanλAUC on large-scale samples. All in all, the gROC can effectively reflect the uncertainty of ROC curve, and the model selection methods based onλAUC andρAUC are better than those based on AUC or sAUC. In some cases, gROC has stronger capability on comparison of classifiers'performance.(2) Isometrics analysis and rank correlation analysis are used to compare the performance measures of classifiers intuitively and clearly, and prove the effectivity of wAUC to dealing with imbalanced data.The shortcomings of traditional AUC is that it doesn't consider the cost bias, and adapts the same weights (i.e. which is assumed to have the same cost with value 1) for each region during the calculation of AUC. For two-class imbalanced data, wAUC make the weights vary with the values of the true positive rate (TPrate) to pay attention to minority class that is more important in common. wAUC is compared with other common performance evaluation metrics by isometric analysis. The natures of wAUC with linear weighted function and exponential weighted function are explored by the simulation experiments, and the two isometrics are non-linear parallel. In addition, the case that the isometric of wAUC and the isometric of AUC under exponential weighted function intersect is analyzed. The isometrics analysis experimental results show that wAUC can distinguish the classifiers with the same AUC values, and can better evaluate the classifiers for imbalanced data. The rank correlation analysis confirms that wAUC has greater correlation with TPrate than other measures, and fits for imbalanced data learning.(3) An improved AdaBoost algorithm, IL-AdaBoost, for XML documents classification is proposed by making use of ensemble learning.XML document integrates the text content and structure information. And XML document classification is a research branch in XML data mining. Most those research work remain on static XML data. But in practical applications, XML data is often dynamic. Therefore, the data mining algorithms that can reflect the variability of XML data are urgently needed. IL-AdaBoost is proposed based on discussing whether the ensemble learning has possibilities to be used for XML classification.Based on the dynamic characteristic of XML documents, a method that builds feature space is proposed by applying the H-Dom model to mine frequently changing substructure. This method uses the feature space to represent the sample space, uses the IL-AdaBoost algorithm on feature space, and builds the ensemble incremental learning algorithm that can deal with XML data classification. It uses XML frequently changing substructure as the feature to build the decision stumps as weak classifiers of boosting algorithm to improve AdaBoost algorithms. It simulates new generation of XML documents through the Poisson process to reflect the time-varying characteristics of XML documents, updates the distribution of the sample to achieve incremental learning, and improves the differences of basic classifier by sampling to enhance performance of ensemble learning.To be conclusion, in this paper, the following tasks of the model selection problems for machine learning are accomplished in the light of classifier performance metrics, ensemble learning and other aspects. At first, gROC and gAUC are proposed, and two performance evaluation metrics based on gROC,λAUC andρAUC, are designed. Secondly, the experimental results on UCI data set show the advantages of these two metrics. The properties of the performance evaluation metric, wAUC, are analyzed by isometrics analysis and rank correlation analysis. Finally, the IL-AdaBoost algorithm for XML document classification is proposed.Although some research results have been obtained, there are also some areas for further study, for example, building the classifiers by optimizing wAUC, refining the general approach of model selection, and so on.
Keywords/Search Tags:Data Mining, Machine Learning, Classification Algorithm, Model Evaluation, Model Selection, ROC Curve, Ensemble Learning
PDF Full Text Request
Related items