Font Size: a A A

Research On Optimization Method Of Automatic Machine Learning For Small Sample Numerical Tabular Data

Posted on:2022-12-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:1488306605975239Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the industrial revolution,the new generation of information technology represented by machine learning has played an important role in the national key strategic areas.However,the small sample numerical tabular data learning problem is encountered when machine learning is applied to the national key strategic areas such as materials science,physical and chemistry,biomedicine,national defense and manufacturing industry.And the successful applications of machine learning depend heavily on the domain experts.Therefore,the threshold of applying machine learning is high and the successful applications are difficult to extend into other areas.Although the existing automatic machine learning technology can reduce the difficulties of applying machine learning into various fields to some extent,the enhancement from the data level is not achieved for small sample numerical tabular data learning problems.The insufficient generalization performance problem is still encountered for the small sample numerical tabular data.Besides,the risk of overfitting is still high with the performance only evaluated by the modeling performance on small sample numerical tabular data.And some optimization methods used in automatic machine learning have too many initialization parameters to be predefined.To improve the performance of automatic machine learning modeling for small sample numerical tabular data,the fields of automatic data enhancement,automatic feature construction,parameter optimization methods,model performance evaluation are systematically and deeply studied in this paper.The main research contents and innovations of this paper are as follows:1)Automatic lasting data enhancement method based on generalization error bounds:Aiming at the problem of automatic lasting data generation and screening of the small sample numerical tabular data,the generalization error bounds based on the Rademacher complexity is studied and adopted to avoid the introduction of noise samples to some extent.Meanwhile,the effective lasting data enhancement of small sample numerical tabular data is achieved by combining with the data generation method of domain extension.2)Automatic data feature construction method based on the consistency of evaluation indexes:Based on the data distribution distance theory,the main factors that affect the data distribution performance are studied and the non-overlap degree index is designed.The experiment has shown that the non-overlap degree is highly consistent with the prediction accuracy of the machine learning model.Then the automatic feature construction methods GP-ANO and GP-AANO for balanced data and imbalanced data are proposed respectively based on the consistent indexes.(1)For the balanced data,the non-overlap degree is introduced into the automatic feature construction method based on genetic programming and the experimental results demonstrated that the automatic feature construction method with the non-overlap degree can achieve a better generalization performance and data distribution.(2)For the imbalanced data,the problem of non-overlap degree is analyzed and the augmented non-overlap degree is proposed.The augmented non-overlap degree combined with the AUC index is used to improve the generalization performance of the automatic feature construction method for the imbalanced data.3)Efficient optimization method based on teaching-learning based optimization algorithm:Teaching-learning based optimization algorithm has an advantage of needing fewer initial parameters.While,it has a search bias to the origin.The search bias is firstly analyzed,and then the adaptive learning factors are introduced to eliminate the search bias.At the same time,the random self-learning and mutation stages are introduced to increase the diversity of the population,preventing trapping into local optimal and improving the optimization performance.4)Automatic machine learning framework for small sample data:An automatic machine learning framework for the small sample numerical data is proposed based on the fusion of automatic data enhancement,automatic feature construction,automatic algorithm parameter optimization and ensemble learning.The error of ensemble learning and the overlap degree are deeply analyzed and an evaluation index based on the ensemble learning error and the improved augmented non-overlap degree is designed.The proposed automatic machine learning framework is validated on the practical data modeling problems of materials and biomedical fields.Therefore,the method studied in this thesis can be effectively applied to small sample numerical tabular data learning problem and the optimal model can be obtained automatically.Overall,the researches and proposed methods in this thesis have important theoretical and practical application value for same sample learning and automatic machine learning.
Keywords/Search Tags:Small sample learning, Numerical tabular data, Automaic machine learning, Evolutionary optimization, Performance evaluation
PDF Full Text Request
Related items