| Protein raw mass spectrometry data is generally composed of multiple pairs of individual data,and the current analysis and research on these individual data pairs cannot accurately reflect the raw mass spectrometry data,extract quantitative characteristics,and support other in-depth applications.Especially in the process of identifying mass spectrometry data,the characteristics of single sample,small data capacity,irregularity,and inconsistency in length make it difficult for commonly used machine learning methods.Commercial software is expensive,the database is not updated in a timely manner,and the functionality is limited.The commonly used similarity calculation generated subsequently cannot accurately distinguish due to the large differences in mass spectrometry data.To solve this problem,this article proposes two mass spectrometry data preprocessing methods.Two corresponding models are established and evaluated for classification prediction and similarity identification and recognition of the data according to these two data preprocessing methods.This is to more accurately analyze protein raw mass spectrometry data and provide more reliable theoretical support for related fields of research.(1)Two mass spectrometry data preprocessing methods were designed to address the characteristics of single sample,small data capacity,irregularity,and inconsistency in length of the raw mass spectrometry data.One method is to augment the original mass spectrometry data using a complex resampling method called Flex-Bootstrap,and then establish a standard mass spectrometry library using reference mass spectrometry data and augmented pseudosamples.The other method is to first establish a standard format for unknown mass spectrometry data without augmentation,and then fill the reference mass spectrometry data into the standard format of the unknown spectrum.(2)In response to the classification difficulties and accuracy improvement issues of commonly used machine learning methods,data augmentation is first used,and then deep learning methods are used in mass spectrometry data processing to establish a multi-class recognition network for the identification of protein mass spectrometry data.A mass spectrometry data analysis method based on Flex-Bootstrap and a neural network fusion model is proposed to solve the problems in mass spectrometry data retrieval,and protein mass spectrometry data is used as an example for validation.The results show that the proposed Flex-Bootstrap-based method achieves an accuracy of 98.82% and a loss function value of 0.0397 in a fusion model of multiple convolutional neural networks(Multi-CNN)and deep neural networks(DNN).This approach not only effectively solves the problem of underfitting in data retrieval using DNN model,CNN and DNN fusion model,but also improves the classification accuracy and search efficiency of mass spectrometry databases,verifying the feasibility of this approach.(3)A study was conducted on the identification of protein mass spectra using an entropybased similarity scoring algorithm,which addresses the common problems of tedious and inaccurate similarity calculations in mass spectrometry data processing.Without expanding the data,this method evaluated the similarity between two mass spectra by computing their information entropy and using it as a coefficient in the mass spectral entropy scoring algorithm.Compared to commonly used similarity calculations,this method not only improves identification accuracy and reduces computational complexity,but also produces consistent results.These findings demonstrate the feasibility of the proposed approach.Additionally,applying a power transformation to the original intensity values greater than 1could significantly improve the algorithm’s coefficient of variation,which is helpful for enhancing the search efficiency of mass spectral databases.This article enriches the research on protein species recognition,which not only allows for the comparison and matching of a large number of mass spectrometry spectra in a short time,but also reduces the cost increase caused by the use of specialized software.Moreover,it could help the life sciences and medical industries improve the search efficiency of mass spectrometry databases,quickly identify and authenticate the species of proteins,and provide strong support for research and applications in related fields,making it more suitable for practical application scenarios. |