PAC-Bayes risk bound integrating theories of Bayesian paradigm and structure risk minimization for stochastic classifiers has provided a framework for machine learning algorithms and derived some of the tightest generalization bounds. The effectiveness and correctness of the PAC-Bayes theory are deduced by the Probably Approximately Correct model and Bayesian decision theory. PAC-Bayes bound is the important statistical factor to measure the generalization performance of machine learning algorithms, and has the strict mathematical expression and a general meaning.This thesis applied the PAC-Bayes risk bound for assessing the generalization performance of SVM. First of all, the open test and close test are built by five UCI data sets, the PAC-Bayes bound and the statistical factors are calculated, including sensitivity, specificity and accuracy. By analyzing the covariance and correlation coefficient between the PAC-Bayes bound and related statistical factors, the experimental results demonstrate that the PAC-Bayes bound has a high negative correlation with the accuracy and a certain negative correlation with specificity and sensitivity. Secondly, as the method of assessing the performances of model, PAC-Bayes bound is compared with the N-fold Cross-Validation. Their results are highly consistent and show that PAC-Bayes bound can reflect the generalization risk bound perfectly. Furthermore, PAC-Bayes bound has been applied to the model selection of SVM to select the best penalty parameters and kernel parameters rapidly. Finally, SVM and PAC-Bayes bound are used to structural prediction of protein.A major issue in practical use of PAC-Bayes bound is estimations of unknown prior and posterior distributions of the concept space. In this thesis, by formulating the concept space as Reproducing Kernel Hilbert Space(RKHS) using the kernel method, we propose the random sampling method and Markov Chain Monte Carlo(MCMC) sampling method for simulating sampling the posterior distributions of the concept space, and realize the calculation of Kullback-Leibler divergence and PAC-Bayes bound. Furthermore, we propose the variance minimization method to investigate the statistical significance of the support vectors, and optimize the support vectors and their weight vectors. The experimental results on two artificial data sets show that the method of simulation is reasonable and effective in practice.Based on formulating the concept space as Reproducing Kernel Hilbert Space(RKHS), we propose a refined Markov Chain Monte Carlo(MCMC) sampling algorithm by incorporating feedback information of the model to simulate the sampling posterior distributions of the concept space. Furthermore, we used a kernel density estimation method to estimate the probability density of posterior distributions for the calculation of the Kullback-Leibler divergence of the posterior and prior distributions, and then solve the calculation problem of PAC-Bayes bound. Finally, we use the random sampling method, MCMC sampling method and refined MCMC method respectively, and the experimental results show that the method improved the calculation of PAC-Bayes bound.
|