Background:Primary central nervous system lymphoma(PCNSL)is a type of lymphoma that occurs in the central nervous system.It has an annual incidence rate of 0.5 cases per 100,000 people,with the incidence rate in the elderly being four times higher than in younger adults,and showing an increasing trend over the years.Currently,there are two diagnostic methods for this disease.One is magnetic resonance imaging(MRI),which is safe and effective,but it is difficult to differentiate PCNSL from other neurological diseases,leading to misdiagnosis.The other is brain biopsy,which provides accurate results but requires invasive surgery,greatly increasing the risk of brain hemorrhage or even death,especially in elderly patients.Therefore,there is an urgent need for accurate and non-invasive diagnostic methods to reduce the risk of death and disease burden in PCNSL.In fact,growing cancer cells release extracellular vesicles into body fluids,which are involved in cancer cell metastasis and other activities.Compared to other cellular components,extracellular vesicles have advantages such as good stability,high expression levels,strong specificity,and non-invasive extraction.Therefore,fluid-based extracellular vesicles and their contents,such as ncRNAs and proteins,are promising biomarkers for disease discrimination.Single exosome sequencing measures the expression levels of surface proteins on individual exosomes,providing data support for differential protein marker analysis at both the single exosome level and the bulk analysis level in tissues.Differential expression analysis based on different types of sequencing data is an important and challenging task for exploring the mechanisms of disease occurrence and identifying disease biomarkers.Single exosome sequencing data consists of matrix data of exosome protein expression levels from different samples.Differential analysis can be classified into bulk analysis at the sample level and single exosome analysis at the individual exosome level based on different scales.The former compares the total protein expression levels of exosomes in different groups to screen for differentially expressed proteins from an overall perspective,while the latter screens for differentially expressed proteins based on the expression levels of individual exosomes in tissues,taking into account the heterogeneity of exosomes.These two approaches validate and complement each other,and the rational and effective use of these methods can accurately identify genes or proteins that are relevant to the occurrence of cancer.Based on the screening of protein markers in PCNSL,further construction of disease discrimination models will provide an important methodological foundation for early diagnosis and identification of cancer.During the development of diseases,proteins interact with each other,forming co-expression networks that execute complex regulatory and signaling functions.Therefore,protein co-expression information is crucial for accurate disease discrimination in extracellular vesicle protein expression data.Graph neural networks incorporate neural networks into graphs to solve complex network problems,and have demonstrated superior performance in tasks such as node classification and link prediction in graphs.The algorithm principles of neighborhood aggregation and information propagation in graph neural networks align well with the form of network information transmission in protein function.Therefore,building disease discrimination models based on graph neural networks in protein expression data is expected to be more accurate and can provide better methodological tools for clinical research.Objectives:Samples of tears were collected from the experimental subjects in the case and control groups,and the expression levels of surface proteins of extracellular vesicles in tears were detected.The purpose of this paper is to use this sequencing data to:(1)Screen for protein biomarkers of primary central nervous system lymphoma to provide reference for the study of disease pathogenesis and targeted therapeutic drugs.(2)Build a discrimination model for the onset of primary central nervous system lymphoma to assist clinicians in accurate identification of the disease and provide technical support for the prevention,early diagnosis,and early treatment of the disease in high-risk populations.Methods:In the study of exosome surface protein markers,a total of 70 subjects were included,including 32 patients with primary central nervous system lymphoma and 38 control individuals.Tears from lacrimal glands of these 70 subjects were subjected to single exosome sequencing,resulting in a total of 542,428 exosomes,with expression levels of 198 proteins in each exosome.Using exosomes as the sample unit and proteins as variables,a 542,428-row,198-column matrix of single exosome protein expression levels was constructed for the case and control groups.Based on this large data matrix,two strategies were employed to explore differential proteins.The first strategy involved analyzing the total protein expression levels.The matrix was transformed into a 70-row,198-column matrix of total protein expression levels for each individual by summing the protein expression levels in each sample.The TMM algorithm was used for normalization,followed by one-way analysis to eliminate irrelevant protein features and narrow down the selection range.Lasso algorithm was then applied to further select protein variables based on the total expression level data,obtaining correlated proteins.The second strategy involved preliminary selection of high-variance proteins using the VST algorithm on the single exosome expression matrix,and then setting different individuals as random effects.The Lasso algorithm of the generalized mixed-effects model was used to obtain protein variables selected based on the single exosome data.Further validation of the proteins selected by the above-mentioned algorithms was conducted through external validation using data from TCGA and GEO databases,which provide RNA expression data for PCNSL disease.After normalization and batch effect removal,the case and control external validation datasets of this study were constructed.ROC curves were plotted and AUC values were calculated based on the gene expression levels of the selected proteins to validate the reliability of the biomarkers on external data.For the prioritized proteins selected,immunohistochemical staining experiments were performed to confirm the expression differences of the selected proteins in PCNSL tumor tissue and normal tissue at the experimental level.Finally,functional analysis of the selected proteins was conducted to further explain the rationality of the selected protein markers from a biological functional perspective.In the construction of disease discrimination models,the protein expression matrix based on the total protein expression levels obtained from the first strategy was used,with a total of 70 samples.Logistic regression and graph attention networks were used to build prediction models.For Logistic regression,the significant results from the one-way analysis were used for feature selection using stepwise regression and Lasso regression,and the selected features were included as independent variables in the model.For graph attention networks,the proteins selected using the two strategies were used to construct a protein expression network based on the WGCNA method,where the network adjacency matrix served as the input for the graph attention network,and the corresponding expression level features served as the input feature matrix.As a control,all protein features without filtering were included in the graph attention network model for disease discrimination,to validate the effectiveness of the proteins selected in this study.The internal validation of the models was performed using ten-fold crossvalidation,and model performance was evaluated using sensitivity,specificity,accuracy,F1 score,ROC curve(AUC),calibration curve,and decision curve metrics.Results:(1)Using one-way analysis on the protein expression data,16 differentially expressed proteins were identified out of 198 proteins with statistical significance,including CD9,ADAM10,PCDH17,ITGB1,CDH1,ITGA2,TACSTD2,NLGN1,CD151,ITGB7,TENM4,CD44,CDH5,CDH12,KIT,and LAMP2,with CD9 showing the most significant difference and CD44 having the highest odds ratio(OR).Based on this,Lasso regression was used to select 10 proteins,including CD9,CD44,CDH1,CDH5,ITGB7,PCDH17,KIT,LAMP2,NLGN1,and TENM4,which were further subjected to multivariate regression analysis,revealing significant differences for CD9 and CD44.(2)Using the VST method,high variable proteins were initially screened from the extracellular vesicle data,resulting in the identification of 87 proteins.Subsequently,the glmmLasso method was employed,which identified 54 proteins that showed significant differences between cases and controls.In comparison with the total protein analysis,the 7 proteins ADAM10,CD9,CD44,PCDH17,CD151,CDH5,and LAMP2 that had already been identified through univariate analysis still exhibited statistical significance in the glmmLasso selection.Furthermore,the proteins CD9,CD44,PCDH17,CDH5,and LAMP2 that were selected by the Lasso regression were also found to be significant in the glmmLasso results.Additionally,this method also identified IL-10,CXCL13,and other protein markers that have been experimentally verified to be associated with PCNSL.(3)A new dataset was created by combining RNA sequencing data from 16 patients with primary central nervous system lymphoma(PCNSL)and 7 control lymph node samples,all from Asian populations,obtained from GEO and TCGA databases.After normalization and batch effect removal,a matrix of sample × RNA expression levels was obtained,and ROC curves and AUC values were calculated for corresponding gene expressions.The AUC values for CD9,CD44,PCDH17,and ITGB7 were 0.768,0.679,0.813,and 0.938,respectively.Subsequently,immunohistochemistry staining experiments were conducted on CD9,which showed prominent staining results,confirming its high expression in PCNSL tumor tissues.Functional analysis of the differentially expressed proteins obtained from the extracellular vesicle data analysis was then performed using GO and KEGG databases.The results showed that the pathways were predominantly associated with cell adhesion and regulation,leukocyte migration regulation,positive regulation of leukocyte-mediated immunity,and maintenance of blood-brain barrier.Existing literature has confirmed that these pathways are closely related to the occurrence and development of PCNSL.(4)In the study of tumor classification models,the extracellular vesicle protein graph attention network achieved an accuracy of 0.814,precision of 0.825,recall of 0.798,and F1 score of 0.811.The AUC for the training set was 0.919,and the AUC for the testing set was 0.869,which was higher than other models.The calibration curve had the best fit,and the decision curve showed higher net benefit between the threshold probabilities of 0.2 to 0.7 compared to other models.The logistic regression with Lasso feature selection had an accuracy of 0.786,precision of 0.773,recall of 0.752,and F1 score of 0.757.The AUC for the training set was 0.888,and the AUC for the testing set was 0.826.The logistic model with stepwise regression feature selection had an accuracy of 0.786,precision of 0.773,recall of 0.752,and F1 score of 0.757.The AUC for the training set was 0.892,and the AUC for the testing set was 0.828.Conclusion:(1)Through statistical analysis of single extracellular vesicle sequencing data,this study identified the differential protein CD9,which plays an important role in the pathogenesis of PCNSL.The RNA-seq data of PCNSL from TCGA and GEO databases were used for external validation of the differential protein,and experimental verification was conducted through immunohistochemical staining experiments.Functional analysis using GO and KEGG databases revealed that CD9 is involved in lymphocyte differentiation,regulation of the immune system,and immune response,while also playing a role in cell proliferation,migration,and adhesion,all of which are closely related to the occurrence and development of PCNSL.This provides a research basis for further investigating the pathogenesis of PCNSL and developing new therapeutic targets in medical research.(2)This study established a protein co-expression network based on extracellular vesicle protein expression data,and further constructed a graph attention network for PCNSL discrimination.The model achieved the best AUC performance on both the training and testing sets,with excellent consistency in calibration curve,and higher clinical decision curve yield compared to other models.These results indicate that the model has good discriminatory performance and provides a new auxiliary tool for the diagnosis and discrimination of PCNSL. |