| Cells are the basic units that make up the structure of organisms,and are known as the “building blocks” of life.Exploring the development and differentiation of cells and their impact on organs has always been a focus of research in the field of life sciences.In recent years,the development of single-cell RNA sequencing technology has allowed the identification of various cell types at the single-cell resolution,promoting the understanding of the phenotypic and compositional heterogeneity of cells in complex tissues by biologists.Although,accurately annotating cell types for each cell sample in single-cell RNA sequencing data is a critical step in the single-cell data analysis pipeline,the high dimensionality,dropout phenomenon,and batch effects of single-cell RNA sequencing data make it a challenge to accurately annotate cell types in single-cell RNA sequencing data.To further improve the accuracy of cell type annotation in single-cell RNA sequencing data,this article will focus on several aspects: batch effects between datasets,incorrect cell type annotation in reference datasets,construction of more comprehensive reference datasets from multiple sources,and development and improvement of cell type annotation models from a model interpretability perspective.The goal of this research is to advance the field and meet the needs of researchers in biological analysis.The main research content of this article is as follows:(1)To extract discriminative features,a cell type annotation method based on deep metric learning is proposed.This method trains a feature extractor to extract embedding features of cells such that cells of the same type are closer together in the new embedding space,while cells of different types are farther apart.Through validation on multiple benchmark datasets,it is found that the feature extractor based on deep metric learning can extract specific information corresponding to each cell type.Experimental results show that the extracted specific information can effectively eliminate batch effects between the reference and query datasets,thereby improving the performance of cell type annotation.(2)A cell type annotation method based on collaborative learning is proposed to improve the accuracy of cell type annotation under the condition of a defective reference dataset with annotation errors.The method involves two networks that learn from each other and in each iteration,training samples with incorrect annotations are removed while clean samples are retained for network training.In addition,an interpretable module is constructed based on the proposed model to help users explore potential maker genes and disease marker genes.Experimental results on multiple benchmark datasets show that the proposed method can effectively improve the accuracy of cell type annotation under a defective reference dataset and successfully identify marker genes for cell types.(3)To accurately annotate cell types of query cells and discover new cell types in spite of defective reference data sets,a cell type annotation method based on a universal domain adaptation strategy is proposed.The method consists of a shared biological signal extractor and two cell type classifiers that are initialized using different methods.By iteratively minimizing the divergence between the outputs of the two classifiers,the method can identify new cell types in the query dataset and annotate other query cells with potential cell types.This approach leads to improved cell type annotation accuracy across various single-cell datasets,while also mitigating batch effects.Extensive evaluations on different single-cell datasets demonstrate that the proposed method outperforms the baseline method;(4)To integrate multiple reference datasets for cell type annotation,two deep learning frameworks based on multi-source domain adaptation strategies are proposed to improve the accuracy of cell type annotation.Multiple reference datasets can be seen as multiple domains,each domain containing common real biological signals and noise.The first method incorporates the query dataset into the network training process and extracts the common real biological signals among different datasets for cell type annotation by aligning the embedding distributions between the reference and query datasets to eliminate batch effects.The second method aligns each reference dataset with the query dataset separately and trains a corresponding classifier for each reference dataset.Finally,the results of these classifiers are integrated to obtain the final cell type annotation.Experimental results show that both methods can improve the accuracy of cell type annotation to some extent. |