| With the continuous development of second-generation high-throughput sequencing technology and single-cell sequencing technology,a lot of big genome sequencing projects that benefit mankind continue to develop,and a large number of single-cell sequencing data and genome variantion data have been produced.At this stage,the focus of scientific research has been gradually turned from data generating and variant discovering to the discovery of deep-level mechanisms behind variants,and the exploration of relationship between phenotypes and cell types from genomic variants.This change in the focus of scientific research is of great significance for understanding the pathogenesis of major diseases and discovery of drug targets.The joint analysis of single-cell sequencing data and phenotypic data is an important way to solve this problem.Through the analysis of the influence of different gene expressions on tissues and organs on phenotypic traits from the expression of a large number of quantitative trait loci(expression Quantitative Trait Loci,e QTL)data and genome-wide association study(Genome-wide association study,GWAS)data,as well as combined with the cell type characteristics in the single-cell sequencing data,people can analyze the occurrence and demise of diseases and other phenotypes from the perspectives of cell differentiation and immune cell function in different tissues and organs.However,the high level of noise,severe signal loss,and multi-source heterogeneity in single-cell sequencing data and phenotypic data have affected the application of existing analysis methods.It is necessary to raise a method about data processing and analysis for the intrinsic characteristics of cell type-phenotype data and the establishment of association relationships.This paper focuses on the cell type-phenotype association analysis method based on single-cell sequencing data.By means of data cleaning and improving the accuracy of cell type classification,a pioneering method of cell type-phenotype association analysis is proposed,and adding the visualization method of cell type-phenotype relationship,a complete analysis method of single-cell sequencing data and phenotype data from data to chart,from cell type to phenotype is finally constructed.The main content of this article includes the following aspects:(1)Aiming at the problems of single-cell sequencing data with multiple sources and heterogeneity,data missing,large noise,low quality,etc.,this paper studies the cleaning method of single-cell sequencing data.This method integrates a set of automated cleaning workflows including peak feature selection,sparse matrix construction,cell label mapping,data standardization,etc.,and covers single cell transposase-accessible chromatin using sequencing(sc ATAC-seq)data processing,cell type label mapping based on homologous single cell RNA sequencing(sc RNAseq),annotation information filtering,read signal enhancement of sc ATAC-seq data,etc.,avoiding artificial errors and reducing The noise in single-cell sequencing data is eliminated,and the data quality and integrity are improved.(2)Aiming at the serious congenital lack of read signal in sc ATAC-seq data caused by the limitation of sequencing technology,this paper studies the cell typing method of single-cell ATAC sequencing data based on SVM,which affects cell typing and data analysis.This method first tests the performance of SVM directly applied to the sc ATAC-seq dataset and analyzes its shortcomings by designing closed tests and open test experiments.At the same time,it scores the correlation between two peaks on the genome based on the cis-regulatory relationship,and Combining it with the loss calculation method of the SVM kernel function,the SVM kernel function based on the peak correlation score is proposed,and finally the accurate cell typing on the sc ATAC-seq data set is realized.Model training and prediction experiments on multiple sets of human cell sc ATAC-seq data containing cell type annotation tags show that compared with the current leading methods in the field,this method has a greater improvement in cell classification performance.Through effective acquisition of sc ATAC-seq data characteristics,the success rate and accuracy of cell typing are significantly improved.(3)This article focuses on the important issue of mining the association relationship between specific cell types and phenotype traits,with the purpose of discovering the association relationship between cell type and phenotype and constructing the correlation pathway of "cell type-phenotype".Cell type-phenotype association relationship mining method for location analysis.This method first performs standardized preprocessing on the phenotypic data,identifies the window regions on the genome that can be used for colocalization analysis based on the significant SNP variants in the e QTL data and GWAS data,and then based on the inferior positions of all SNP variants on these windows Co-localization analysis of gene frequency.After establishing the association relationship between tissues and organs and phenotypic traits through SNP mutations and obtaining representative colocalization genes on them,the cell-specific peak characteristics of sc ATAC-seq data are fused to perform cytotype-phenotype association analysis,and finally A variety of potential "cell type-phenotype" association relationships and co-localization genes have been obtained.Using publicly published literature and public databases and conducting comparative analysis experiments,we verified the accuracy and rationality of more than 10 groups of "cell type-phenotype" associations,and confirmed the effectiveness and practical significance of this method.(4)This article focuses on the efficient annotation and visualization of cell and phenotypic character data analysis,and studies the annotation and visualization methods of cell type-phenotype correlation.According to the key points of positive correlation,negative correlation and contrast that need to be emphasized in the cell type-phenotype relationship,this method reasonably adopts the two-dimensional circular genome visualization concepts and ideas,and constructs a circular genome positioning model.A number of visualization framework models for the needs of multi-omics data,genomic region association relationships,etc.were constructed.Finally,a cell type-tissue organ-phenotype model was constructed by fusing the characteristics of the framework and combining the inherent characteristics and characteristics of biological association relationships.Visual experiments on the results of cell type-phenotype associations obtained in this paper,cell type-phenotype associations in published literature,etc.,show that this method fully meets the requirements for drawing cell type-phenotype association relationships and effectively integrates organisms.The relationship between pathways and data,through the temporal and spatial transformation of linear sequence data,has high practical value for discovering and annotating unknown cell-phenotype relationships. |