| Near-infrared spectroscopy(NIRS)technology is a widely used non-destructive detection technique in agriculture,cultivation,medicine,environment,ecology,etc.It is simple,fast,nondestructive,and cost-effective.NIRS technology is most commonly used combined with machine learning for fruit quality detection and discrimination.Many scientists are studying the building of high-precision fruit quality prediction models.As the feature dimension and the sample size of data are getting larger,the inaccurate samples,redundant samples,missing samples,and noisy samples is contained in the large sample set with high dimension,which affects the in high performance and decisions.To optimize the sample set and improve the model performance,Instance Selection is used to perform secondary sample selection on the train set.However,so many algorithms are not fitting for the task of NIR instance selection.It is significant to research it.In this thesis,a novel instance selection method was proposed that can be used suitably for spectral data.The main research contents and conclusions include:1.The instance selection method based on Least angle regression(LARIS)was proposed in this section.The correlation between the sample and the standard sample is focused by LAR,and the geometric angle distance was used to reflect the correlation between the sample and the standard sample spectrum.The standard sample is approximated by the linear combination of the samples,and the samples with coefficient 0 are eliminated.According to the correlation value,the training samples are ranked,and the cross-validation model is introduced one by one to filter out the subset of training samples with the smallest prediction error or the highest accuracy as the final optimal train set.2.In this section,the apples were used as experimental objects,and the near-infrared spectroscopy apple Brix prediction experiment was designed to explore the performance of instance selection algorithms on regression such as LARIS,Kennard-Stone(KS),Sample Set Partitioning based on Joint X-y Distance(SPXY),Condensed Nearest Neighbor for Regression(Reg CNN),and Edited Nearest Neighbor for Regression(Reg ENN).The Partial least-squares regression(PLSR)models were established on the regression task.This experiment was conducted on the laboratory fruit quality intelligent online inspection equipment,the diffuse transmission spectra of samples were collected,and the Brix of apples were measured.Compared with the effect of the raw train set,the results shows that the exhibition of Reg CNN,Reg ENN,KS,and SPXY are extreme defects.Either with great compression rate and loss of model performance or with insignificant compression and maintain model performance.These methods are not suitable for NIR spectra instance selection.From the comprehensive analysis of the absorbance peaks of the sample spectra,outlier samples,sample set distribution,and fitting bias,the spectral characteristics of the optimal train set established by LARIS are more obvious,the entire sample set is more uniformly distributed,the fit to each wavelength point of model is more stable,and the bias between RMSECV and RMSEP is reduced.The prediction error of LARIS optimal set is reduced by 6%using only 48.5%of the raw train set.Therefore,LARIS method is verified that it is more suitable for instance selection on simple regression of NIR spectral data,and can be achieved optimizing the train set while improving the model prediction ability.It provides a theoretical basis and experimental support for the application of the LAR to instance selection for classification.3.In this section,the apples were used as experimental objects,and the near-infrared spectroscopy apple origin discrimination experiment was designed to explore the performance of LARIS,KS,SPXY,CNN,ENN and Segmented KNN instance selection algorithms on the classification.The Support Vector Machine(SVM)classification model was built.Diffuse reflectance spectra of apples from four origins were collected by a laboratory portable near-infrared spectrometer.When the LARIS method is used on multiple classes of samples,the respective optimal samples of each class are first selected,and a union set of each class optimal samples are taken,the subset with the lowest error or highest accuracy is determined as the final optimal train set for prediction.Comparing the terms of clustering distribution,imbalance,compression rate,accuracy and stability,the results reveal that,the difference of optimal train set with the raw train set and test set are increased selected by CNN,ENN,Segmented KNN,KS and SPXY methods.The 73.1%of the raw training samples in the raw training set are optimized by LARIS,which improves the classification accuracy by about 5%.The distribution of the optimal set is closer to the test set,and the imbalance rate remains.Therefore,the LARIS method was achieved optimizing the raw train set while improving the model prediction ability.It provided a theoretical and experimental support for the application of the least angle regression algorithm to instance selection,and has some significance for dealing with large-scale data set problems. |