| Background: During emergency inspection tasks,drug testing agencies often encounter samples that are missing labels and have unknown ingredients.Accurate classification and identification of unknown samples require the use of various analytical instruments and methods.However,most of these instruments and methods can easily cause irreversible damage to the samples,thereby compromising the integrity of the samples and significantly affecting the efficiency of emergency inspections.In recent years,as raman and near-infrared spectroscopy technologies grow by leaps and bounds,these techniques have been commonly used in the detection field as non-destructive analysis methods.Although these techniques can provide high-quality data,the high dimension and complexity of data pose a challenge for traditional spectrum analysis,which is time-consuming and difficult.To improve the efficiency and quality of inspections,different machine learning algorithms can be used to construct different models based on the analysis objectives,enabling rapid,efficient,and accurate classification and identification of samples.Therefore,this study will focus on combining raman and near-infrared spectroscopy technologies with machine learning to solve the problems of multi-classification and ingredient analysis in the field of drug pattern recognition.Objective:To address potential issues regarding drug classification and component analysis in emergency testing scenarios,this study aims to construct a multi-classification model for drugs from multiple varieties and manufacturers,based on near-infrared spectroscopy data and machine learning methods.Furthermore,this study employs data augmentation and machine learning techniques to develop an analytical model for identifying unknown components in drugs,utilizing both Raman and near-infrared spectroscopy data for various active ingredients and excipient.Methods:To address the issue of drug classification,this study selected a sample of four manufacturers’ compound α-ketone tablets,five manufacturers’ rigevidon tablets,and seven manufacturers’ erythromycin enteric-coated tablets.The near-infrared spectra of these drugs were collected using a fiber-optic probe-equipped near-infrared spectrometer.During the data preprocessing stage,seven types of spectral preprocessing were analyzed for their impact on NIR analysis,including vector normalization,Savitzky-Golay smoothing(SG smoothing),multivariate scatter correction,first-order derivative,z-score normalization,wavelet smoothing,and wavelet denoising.For the classification model,this study employed three variants of Recurrent Neural Networks(RNNs),namely Long Short-Term Memory Networks(LSTM),Bi-directional Long Short-Term Memory Networks(Bi LSTM),and Gated Recurrent Unit Networks(GRU).These models are more suitable for one-dimensional data classification tasks.Different layers were set for the LSTM,Bi LSTM,and GRU models,and a dropout mechanism was introduced after the corresponding layers to prevent overfitting.A fully connected layer was used before the output layer,and a Softmax function was used for classification output prediction.This study compared the classification model indicators of LSTM,Bi LSTM,and GRU models with different numbers of layers.To address the issue of component analysis,firstly,this study collected near-infrared and Raman spectra of 368 compounds and solved the multi-label classification problem by building multiple binary classification models.Secondly,three data input methods were used to train the models and compare which spectral type was more suitable for model construction: using only near-infrared spectra,only Raman spectra,or concatenating both.Then,to ensure that each binary classification model was sufficiently trained,mixture spectra were randomly generated at varying proportions,including positive spectra containing the target analyte and negative spectra without the target analyte.Three different neural networks were compared in the stage of model construction,and the optimal model was selected.Finally,real mixed powder spectral data were used as the test set,and model performance was evaluated using accuracy,precision,recall,and F1 score.Results:In variety classification problem,z-normalization performs the best for all preprocessing methods in the case of nine and sixteen classifications.Among the three variant models,the GRU series model performs the best.In the case of nine classifications,the GRU-3 model achieved an accuracy of 99.65±0.70 and reached 1in all comprehensive indicators of Marco F1 Score,Matthews Correlation Coefficient,and Kappa Coefficient,obtaining the best classification performance.In the case of sixteen classifications,the GRU-2 model achieved the best performance with an accuracy of 98.68±0.42,and all comprehensive indicators reached 0.99.The Bi LSTM series model shows strong classification performance for spectra with severe overlap,such as spectra processed by the first-order derivative.In component analysis problem,when using the data augmentation method in literature,the Res UCA(Residual network-based unknown component analysis,Res UCA)model constructed in this study outperforms the Deep CID(Deep learning-based component identification,Deep CID)model in all four indicators.After optimizing the data augmentation method in literature,except for precision,the other three indicators are better than the pre-optimized model.Using samples before and after grinding for modeling,it is found that the model constructed with the samples after grinding has better performance.Three different spectral inputs were compared for model training,and Raman spectroscopy had the best performance.In the extended experiment,the recall rate of the model remains at a high level.For the problem of increasing false-positive samples,the model parameters can be further optimized according to different scenarios to reduce false-positive samples.Conclusion:For component analysis problem,this study optimized data augmentation,spectral selection,and sample processing to construct the Res UCA model with a residual structure for unknown drug component analysis,which has the potential to be applied in various fields such as food,cosmetics,and coatings.For the variety classification problem,a series of spectral preprocessing methods and different types of recurrent neural network models were compared,and z-normalization processing combined with the GRU series model was found to be suitable for unknown variety classification.If the original spectral data overlaps severely,the Bi LSTM series model can be used instead.These models also have the potential to be applied in NIR data for scenarios such as the traceability of traditional Chinese medicine production areas and mineral evaluation and grading.This study established fast drug multi-classification models and component analysis models through spectral analysis,which can greatly reduce the detection volume of conventional methods.This will help the development of emergency testing technology and drug regulatory technology,further improve the emergency testing response speed,and enhance the ability of drug regulation.At the same time,by introducing machine learning methods into drug spectral analysis as a demonstration study,it provides technical support for combining data from other analytical instruments with machine learning,further improving the utilization of data and adapting to the development of the big data era. |