Font Size: a A A

Optimization Modeling Methods Based On Analyzing Statistical Character Of Errors

Posted on:2019-02-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:G X JiangFull Text:PDF
GTID:1368330551458766Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The accumulation and rapid growth of various types of data bring great challenges to data analysis.As the core technique of data analysis,machine learning extracts rules or knowledge from the data,and provides decision supports for us under unknown circumstances.We hope that models in machine learning could well match the data.However,the adaptivity of the model is far less than the complicatedness of the data.No model can ensure it well matches all kinds of data.This mismatch relationship is usually represented by errors.How to rationally and effectively make full use of the error has been studied for decades in machine learning,and many classical learning algorithms or techniques based on errors have been proposed,such as error back propagation algorithm,AdaBoost and self-paced learning.These works promote the development of machine learning and show the precious value implied in errors.The research about error learning is a common topic of many kinds of learning problems.Meanwhile,it is also a living,open and promising direction.Statistics is one of the major tools for studying data of a certain size.Problems in big data analysis highlights the importance of statistics.Therefore,it is a natural idea to study the characteristics of errors by statistical methods and promote data analysis.This paper focuses on three kinds of errors in machine learning,i.e.,training error,test error and shifting error,and conducts systematic and in-depth researches in the aspects of improving data quality and optimizing model or parameter.In the data level,the correlation analysis and distance measure of time warping series are given,and the noise filtering algorithm suitable for both classification and regression is proposed.In the model level,the theory of CV error estimation are improved,an accurate,stable and efficient CV approach for specific type of data is proposed,and a novel hyper-parameter optimization method for well-posed learning problems is presented.The main works of this thesis are summarized as follows:(1)The correlation analysis,curve registration and distance measurement for time warping data are proposed.The existence of time warping may lead mistakes of correlation and distance measure,then interferes or misguide time series analysis.Firstly,we study the statistical characteristics of the correlation coefficient of time warping series.Then the methods of identifying pseudo-correlation and determining the correlation of time series are proposed.In order to eliminate the time warping,we propose an optimization criterion of curve registration with broader applications.The optimization problem is solved in a more efficient way.The presented maximum shifting correlation distance can measure the distance between time series with phase and amplitude shifts.These methods could effectively eliminate the interference of time warping on the series data analysis and prepare for the subsequent analysis.(2)The elastic noise filtering system is built for supervised learning.The preconditions and the effectiveness of existing noise filters based on model predictions are complemented,and the relations among typical filters are explained from the perspective of probability.We prove the low-noise property of errors within confidence interval,and interpret the relationship between the noise and the error.Then interval-insensitive filter(IIF)algorithm is proposed.The algorithm is supported by loose assumptions,and its effectiveness is verified.These works improves the theoretical basis of existing filters.Moreover,the key concept of the algorithm,the interval-insensitive error,can provide a new perspective or idea for related error learning methods.(3)The relationship between the accuracy and the stability of error estimation is presented,and the key factors of CV error estimation are theoretically found.Indicators for measuring the accuracy and stability of CV estimation are given.Their quantitative relationship is proved,and provides theoretical guidance for improving the accuracy by reducing the variance.Then the relationships between CV variance and related variables are given to form the strategy for adjusting CV variance.The results in classification can theoretically explain some empirical results.Meanwhile,we propose a unified normalized variance in classification and regression,and it can serve as a stable error measurement.This part of work provides theoretical guidance for improving the accuracy and stability of error estimation,and is of great significance to model selection.(4)Markov cross-validation(M-CV)is proposed to give a better estimation of model error for autocorrelated series.The periodicity,overlapping,and correlation of series lead to deviation in error estimation of time series model.Accordingly,three corresponding CV criteria are given and the Markov cross-validation is proposed.The partitioning method ensures that samples within each subset keep certain distance,so that it could avoids over-fitting or loss of series information which leads to underestimation or overestimation of errors.In addition,it is proved that M-CV error estimation has low variance,which ensures the stability of M-CV.Experimental results also show that M-CV has lower deviation,variance and time cost than other CVs.(5)Minimum symmetry similarity criterion based on training errors is proposed to solve the hyper-parametric optimization problem.Conventional grid search cross-validation method is complicated and there exists randomness.The hyper-parameter is optimized based on the similarity between training errors.The directional similarity based on training results is proposed to reflect the similarity of generalization errors.Based on this,we conclude that the symmetry directional similarity achieves the minimum at the best parameter.Then the minimum symmetry similarity criterion(MSSC)is proposed,and it is applicable for hyper-parameter optimizations of five kinds of learning tasks.It has lower time complexity than CV and the result is unique.The proposed direction similarity builds a bridge between training results and prediction errors,making it possible to study generalization ability from the perspective of training results.The paper analyzes the statistical characters of errors between data and model,and conducts researches on improving data quality and optimizing model or parameter.These achievements perfect and expand the theory and application of error learning,and provide effective approaches for complicated data analysis.All these have important theoretical significance and application value.
Keywords/Search Tags:Analysis of statistical character of errors, Optimization modeling, Noise filtering, Cross-validation, Model selection, Hyper-parameter optimization
PDF Full Text Request
Related items