| In recent years,near infrared spectroscopy has been widely used in various fields with its advantages of fast,efficient and accurate.However,the spectral absorption band in the near infrared region is the superposition of octave,combined and differential absorption bands of the fundamental frequency absorption of functional groups(C-H,O-H,N-H,etc.)with higher energy in the organic matter.Due to the serious overlap and discontinuity of the spectra in the near-infrared region and the high dimensionality of the spectral data,it is difficult to directly extract the quantitative or qualitative information about the substances in the near-infrared spectrum and give reasonable spectral analysis.Variable(wavelength)selection is a critical step in multivariate calibration of NIR spectra,which can remove irrelevant and redundant variables,reduce the dimension of spectral data and the complexity of algorithm,so as to improve the prediction performance of the model,make the calibration model more reliable,and provide a simpler and reasonable interpretation.Therefore,variable(feature)selection plays an important role in spectral data analysis.Although the variable selection algorithm has been practically developed in near-infrared spectroscopy analysis,there are still many problems that need to be solved urgently,such as stability,reliability,interpretability,applicability,modeling methods and computation costs.The content of this paper is to improve the methods in stability,reliability,interpretability,computation cost and model prediction performance.At the same time,this paper also focuses on eliminating the negative effects of noise and interference variables on variable selection algorithm and model performance.There are two kinds of near infrared spectrum variable selection methods are proposed:(1)In our study,a new variable selection method of the selectivity ratio competitive population analysis(SRCMPA)is proposed.This algorithm adopts the idea of selection ratio,adaptive weighted sampling and model population analysis,and combines the method of variable arrangement and exponential decline function.The key wavelength is defined as the wavelength with high score value in the regression model.In this paper,the score value of the selection ratio under the PLS model is used as an index to evaluate the importance of each wavelength.Then,according to the importance of each wavelength,SRCMPA sequentially selects N wavelength subsets from Monte Carlo sampling,and runs in iterative and competitive manner.In each sampling operation,the PLS model is built with a fixed ratio samples and the selection ratio value of each variable is calculated.Based on the score value of the ranking selection ratio and the normalized SR(selection ratio)score value as the weight,the key variables are selected by two steps: the compulsory selection of exponential decline function and the competitive selection of adaptive weighted sampling.Finally,cross validation(CV)method is applied to select the optimal subset with the lowest cross validation mean square root(RMSECV).The algorithm has been tested on wheat protein data set and beer data set,and compared with three efficient algorithms.Through the experimental results to evaluate the superiority of the algorithm,this algorithm can find the best combination of the key wavelength variables of the data set,and can be used to explain the chemical characteristics of interest,The evaluation results after modeling are also the best.(2)This paper develops a significant multivariate competitive population analysis(SCMPA)variable selection method,which combines Monte Carlo sampling(MCS),significant multivariate correlation(s MC),the exponential decreasing function(EDF)and weighted bootstrap sampling(WBS)competition methods,and variable sorting strategy based on model parameters and model population analysis(MPA)ideas.SCMPA inherits the core idea of MPA: random sampling and statistical analysis.It makes statistical analysis(i.e.statistical test)on the performance of a large population of sub-models generated by random sampling,and to extracts the information of interest from the output of sub models.It uses the empirical distribution of the output of interest to analyze the importance of variables,which avoids the uncertainty of a single model.In other words,a large number of sub-models are established by Monte Carlo sampling(MCS),and the distribution of s MC values of the output of the sub-models is statistically analyzed.s MC combines regression variance and residual variance from the PLS regression model to statistically determine a variable’s importance,and SMC discards the orthogonal variance decomposition to prevent the influence of non-relevant information contained in datasets,this makes the selected variable stable and reliable.In this study,F-test is also used to assess variables which are statistically significant with respect to their relationship(regression)to the measured value y.The variables are sorted according to their respective F values and defined effective thresholds.SMC provides the most ideal variable list with the minimum false negative and false positive errors.Then,the key variables are selected through two competitive ways of EDF and WBS.First,the EDF forces the elimination of large numbers of distributed non-information or interference variables in the dataset,After the WBS is applied to further eliminate weaker weights variables,similar to the “survival of the fittest”principle.Variables with larger weights have a greater probability of being retained,whereas those with weaker weights are less competitive,and the population of variables are gradually eliminated.Variable space shrinks softly through stepwise updating of the variable weights based on WBS,It can retain the synergy and combination effect among variables,and gradually eliminate the non-information variables through the contraction strategy.Finally,the best variable set is obtained by extracting the optimal sub-model with the minimum the cross-validation root mean square error(RMSECV)from the pool of sub-models.This method is tested on three NIR spectral datasets and compared against three high-performance variable selection methods.The experimental results show that the proposed algorithm has the highest efficiency and the best selection effect,and can usually locate the optimal combination of key wavelength variables in a dataset.At the same time,this study balanced the relationship between the computation cost and the prediction ability of the model.And The evaluation result after PLS modeling is also the best,and it can optimize multiple objective functions after modeling. |