| In the establishment of quantitative analysis of near infrared spectroscopy,it is important to improve the prediction accuracy of the model by selecting characteristic wavelengths and establishing a near infrared spectral regression model with high prediction ability.Spectral data has problems with complex internal feature information and high data dimensions,mainly reflected in the weak absorption and severe overlap of spectral absorption peaks,weak correlation between certain spectral regions and the tested sample components,wavelength points containing invalid variables,and a large amount of redundant information.And then,the number of wavelengths often exceeds the number of samples.Therefore,in response to the above issues,this study proposes improved wavelength selection algorithm to select effective feature wavelengths,eliminate irrelevant variables,and reduce the dimensionality of spectral data,thereby constructing a prediction model with a wider range of applications and higher accuracy,improving the effectiveness of the data and the predictive ability of the model.The main research contents are as follows.(1)RF-iPLS—a near infrared spectral wavelength selection method based on the importance of random forest(RF)features and interval partial least squares(iPLS)method was proposed.This algorithm calculates the mean decrease accuracy(MDA)as a feature importance result based on out of bag data(OOB).By setting a feature importance threshold,the feature variables are selected to form the feature wavelength subset.Due to the randomness of RF,the validity of the selected variables in the feature wavelength subset cannot be guaranteed.Therefore,interval-Partial Least Squares(iPLS)method is used to partition the feature wavelength subset into intervals to compensate for the problem of invalid variables caused by RF’s own randomness.In order to verify the effectiveness of the RF-iPLS algorithm,PLSR models were established and compared with full spectrum PLSR models and PLSR models based on different wavelength screening methods(GA,SPA,RF,iPLS).Based on the grain protein dataset,RF-iPLS selected 12 characteristic wavelengths with RMSEP of 0.69 and R_p~2of0.997,it was superior to the four comparative wavelength screening algorithms.In the corn dataset,RF-iPLS filtered out 43 characteristic wavelengths with R_p~2of 0.977.Compared to the RF algorithm with the best prediction performance among the other four wavelength filtering algorithms,the R_p~2of the model built by the RF algorithm was 0.950,an increase of about 2.84%.The experimental results show that the RF-iPLS wavelength filtering algorithm can filter out effective feature wavelengths in both datasets.It can improve prediction performance.After 500cross validations using the Monte Carlo method,the results of the two feature subsets of corn and grain protein before and after cross validation are close,indicating the rationality of the RF-iPLS algorithm.Verify the effectiveness of the RF-iPLS algorithm by analyzing the distribution characteristics of characteristic wavelengths in the spectrogram.The results demonstrate that RF-iPLS is an effective feature wavelength screening method,which can simplify the complexity of near-infrared spectroscopy quantitative analysis models and achieve efficient dimensionality reduction.(2)Due to the non-linear relationship between spectral data and target physicochemical values,in order to better process spectral data and improve the predictive ability of the model,Convolutional Neural Network(CNN)is used to extract spectral features.Visualization the feature wavelengths filtered by CNN,and then analyze and explore them.For the grain protein dataset,CNN filtered 11 characteristic wavelength points.It mainly distributed in the regions of1190-1124nm,1168-1680nm,and 2160-2424nm;For the corn dataset,CNN selected 18 feature wave points.It mainly distributed between 1200~1600nm and 1900~2300nm.Compared with the number of wavelengths screened by RF-iPLS,CNN screened fewer feature wavelengths.And it reduced data redundancy.On the other hand,using the wavelength filtered by CNN as the input of the PLSR model,the prediction performance of the model is not as good as RF-iPLS,indicating that using the characteristic wavelength filtered by CNN to establish a PLSR model cannot achieve the expected prediction accuracy.Therefore,based on the spectral data converted by CNN into spectral sequence data characteristics,combined with LSTM,CNN LSTM prediction models for grain protein and corn protein were established separately.Due to the fact that the sample size required by the CNN-LSTM model is often much larger than the number of spectral samples,the Bootstrap method is used to resample spectral samples in order to solve the problem of sample size demand.The results showed that the CNN-LSTM prediction model established for 117 grain protein samples had the minimum RMSEC and RMSEP values of 0.60and 0.59,respectively,when the Bootstrap algorithm sampled 7020,achieving the optimal prediction accuracy;The CNN-LSTM prediction model corresponding to 80 corn samples has better prediction accuracy when the number of resampled samples is 7200,with RMSEC and RMSEP values of 0.06 and 0.07,respectively.In order to explore the structure and fitting issues of the CNN-LSTM model,a drop out layer was embedded in the CNN-LSTM corn protein quantitative analysis model for comparison,taking the corn near-infrared spectral dataset as an example.The conclusion proves that the prediction ability of the CNN-LSTM model is superior to that of the CNN-LSTM model with a dropout layer.The CNN-LSTM prediction model reduces the complexity of the model and has high robustness. |