| The protein content of wheat grain is the main index for evaluating wheat quality.The rapid detection of grain protein content can provide scientific and reasonable decision-making for the acquisition and classification of wheat grains.near-infrared spectroscopy(NIRS)has the characteristics of fast analysis speed,high analysis efficiency,wide application range,no pretreatment of samples,no pollution in the measurement process,low analysis cost,no damage to samples,and easy implementation of online analysis.It has been widely used in rapid detection of product componentsWhen using NIRS to quickly obtain rich information about target attributes,a large amount of high-dimensional data is often generated,which makes it impossible to analyze and interpret these data easily.Chemometrics is a commonly used spectroscopy analysis tool.The realization of it is based on data-driven,and often presents different results when faced with different types of data.In order to reduce the influence of different types of data as much as possible,it is necessary to carry out research on the optimization and construction method of the prediction model based on near-infrared spectroscopy for specific target attributesThis paper takes wheat grain protein as the research object,and mainly discusses several aspects(preprocessing method,variable selection and model selection,etc.)involved in the process of spectral modeling.First,it analyzes the influence of different spectral preprocessing methods(spectral interval selection,data set division,spectral transformation,and the sequence of spectral transformation and data set division,etc.)on the spectral model.Then based on the model population analysis,two different variable selection algorithms were established and verified on experimental data.On this basis,the influence of the combination of preprocessing and variable selection on the model is studied.Finally,the performance of the spectral model based on different modeling methods is comprehensively compared The research conclusions are as follows(1)A systematic comparative study of preprocessing methods for wheat grain spectroscopy data was carried out from different aspects.The results show that the data set division strategy has a greater impact on the partial least square regression(PLSR)model.Compared with the Kennard-Stone(KS)algorithm,the sample set partitioning based on joint x-y distances(SPXY)and kernel sample set partitioning based on joint x-y distances(KSPXY)algorithms have absolute advantages in model prediction accuracy and model stability.Although the SPXY algorithm may be slightly inferior to the KSPXY algorithm in terms of model performance,it has an absolute advantage in terms of calculation.The models built with KS,SPXY and KSPXY algorithms based on dataset B have better performance,and the root mean squared error of prediction(RMSEP)of these models are 0.239,0.129 and 0.131 respectively.In addition,the choice of spectral interval also has a greater impact on model prediction performance,especially when there are fewer data samples.For the same samples,the accuracy and stability of the model constructed with the waveband in the 730-1100 nm range is the best,indicating that the preliminary screening of the spectral wavebands is also very important in spectral modeling.In terms of the influence of the spectral transformation method on the model performance,when the spectral transformation operation precedes the data set division operation,it is helpful to improve the model accuracy and model stability.Compared with other spectral transformation methods,fractional derivative and wavelet transformation processing improves the prediction performance and stability of the model,but the calculation is more complicated,and it is necessary to find the best preprocessing parameters through tuning.Based on the dataset B,the root mean squared error of calibration(RMSEC)and RMSEP of the optimal model built with fractional derivative preprocessing are 0.104 and 0.097 respectively,and the RMSEC and RMSEP of the optimal model built with wavelet transform preprocessing are 0.109 and 0.089 respectively.On this basis,the model migration ability of the two is compared,and it is found that the model migration ability of the derivative preprocessing method is better than the wavelet transform preprocessing method(2)Based on the model population analysis and weighted bootstrap sampling method,this study proposes a new variable selection method called adaptive variable re-weighting and shrinking approach(AVRSA).On the three NIRS data sets,using AVRSA can effectively reduce the number of spectral variables(at least 84%)and reduce the prediction error of the model.In addition,through comparison with the three variable selection methods of competitive adaptive reweighted sampling(CARS),Monte Carlo uninformative variable elimination and iteratively variable subset optimization,it is found that AVRSA can quickly find the best information variable subset in high-dimensional spectral data,which effectively improves the prediction performance of the model(3)This study proposes an interval variable selection method based on random frog,called interval selection based on random frog(ISRF).On the three NIRS data sets,through comparison with the four variable selection methods of genetic algorithm PLS,random frog,interval random frog and interval variable iterative space shrinkage approach,it is found that ISRF can effectively find the best interval variables and improve the predictive performance and interpretation capabilities of the model(4)This study uses one of the wheat grain protein spectrum datasets as the research object,and compares the influence of the different combinations of different preprocessings(mean centering,Savitzky-Golay(SG)smoothing,SG first derivative,SG second derivative,multiplicative scattering correction,standard normal variate,detrending method,continuum removal and continuous wavelet transform)and variable selections(CARS,successive projections algorithm and AVRSA)on the PLSR models.It is found that when different variable selection methods are used,the model based on the optimal spectral transformation method is not necessarily optimal;the model built based on the combination of optimal spectral transformation and optimal variable selection algorithm is not necessarily optimal.In addition,the spectral transformation method has a greater impact on the methods of variable selection,and the distribution of the selected variables is quite different.Among different variable selection methods,the model using AVRSA coupling with different spectral transformations has a good stability,and the distribution of characteristic variables has the smallest difference.The model built based on SG second derivative preprocessing and AVRSA algorithm has the best performance,with the RMSEC and RMSEP of 0.203 and 0.176 respectively,and can be used for the construction of wheat grain protein content model(5)In this part,the SPXY algorithm was used to divide the dataset into training set and testing set,and the samples in the training set were amplified,then the convolutional neural network(CNN)model was constructed based on the amplified training samples.Based on two datasets of the wheat grain protein spectrum,the CNN model was compared with the models built with traditional machine learning algorithms(support vector regression and random forest)and PLSR.It is found that the CNN model,without any spectral transformation,can achieve the performance of the PLSR model built with the optimal spectral transformation method.The RMSEs for trainning and test set of CNN model based on the dataset A is 0.192 and 0.161 respectively,and the RMSEs for trainning and test set of CNN model based on the dataset B is 0.081 and 0.100 respectively.In addition,the prediction accuracy of the CNN model is better than the model established by traditional machine learning algorithms.It shows that the use of CNN algorithm can easily and effectively realize the prediction of wheat grain protein content. |