Font Size: a A A

Research And Application Of Machine Learning Methods For Single Cell Sequence Data

Posted on:2023-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:P ZhaoFull Text:PDF
GTID:1524307025964369Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Single-cell sequencing is an emerging technology for sequencing and quantifying genetic information at the level of a single cell,which can be used to analyze cell types,states,interaction mechanisms and evolution,etc.,and has become an important approach in current disease research.Machine learning methods,including cluster analysis,classifi-cation and multi-data integration,are increasingly becoming the main methods for single-cell sequencing data analysis.Cell types can be effectively divided by clustering analysis.By extracting and learning existing cell type labels,cell classification model can be con-structed to guide the recognition of the same cell type quickly and accurately.Single-cell multi-omics data integration analysis can effectively combine multi-angle information of cells,systematically reveal cell types,functions and interactions,reduce false positive rates,and accurately present the overall picture of cell heterogeneity.However,a series of new methods and methods are still needed to improve the accuracy of clustering re-sults and subsequent cell type recognition because many clustering methods are prone to fall into local optimum.The weight of samples is not considered in the accuracy of cell recognition,and the accuracy needs to be further improved.There are still some deficien-cies in data integration and analysis.Based on the above situation,from the perspective of clustering,classification and multi-omics data integration,this dissertation develops corresponding algorithms based on simulation data and real data,aiming at improving the reliability and stability of single-cell sequencing analysis and facilitating the study of complex diseases.(1)Due to the large number of cells and heterogeneity in multi-dimensional attributes,single-cell sequencing data are characterized by high dimension and high noise.The ex-isting clustering methods are very sensitive to noise data and outliers,and they are easy to fall into local optimal solution,which greatly limits the accuracy of clustering.This disser-tation is devoted to the improvement of the clustering method.The single-cell self-paced clustering(sc SPa C)method based on Frobenius norm and the sparse single-cell self-paced clustering(ssc SPa C)method based on l2,1-norm are introduced for the sc RNA-seq data clustering analysis.Each cell was gradually added to the clustering model from easy to complex to avoid the algorithm falling into local optimum by reducing the influence of noise and outliers on the clustering results.Each cell was gradually added to the clus-tering model from easy to complex to avoid the algorithm falling into local optimum by reducing the influence of noise and outliers on the clustering results.The performance of the improved sc SPa C clustering algorithm was evaluated based on simulation data and real sc NA-seq data.The results showed that the performance of the improved sc SPa C clustering algorithm was significantly better than the current clustering algorithm,which could effectively improve the accuracy of the clustering results and subsequent cell type recognition.(2)Identifying cell types accurately can help explain their function and how they re-late to disease.As the clustering analysis of single-cell sequencing data is extremely com-plex and some cells have unknown cell type labels,which cannot be accurately classified,effective classification models are urgently needed to guide cell classification accurately and quickly.This dissertation focuses on single cell sequencing data classification,and proposes a single cell robust softmax regression(sc Ro SR)model derived from the soft-max multiclassification model to guide the identification of specific cell types.Specifi-cally,sc Ro SR considers a weighting scheme that is able to assess the importance of each individual cell.Single cell data are involved in the classification problem according to their weight.In this way,the impact of noise data and outliers(which are usually lightly weighted)can be greatly reduced.However,standard self-paced learning is affected by class imbalance problems,and if some cell types are insensitive to loss,they have little effect during training.In order to alleviate this problem,two new soft weighting schemes were designed to assign weight to each cell type and select cells with self-paced strategy for each class.The performance of sc Ro SR classification algorithm was evaluated based on simulation data and real single-cell sequencing data.The results show that sc Ro SR classification algorithm has stable single-cell type recognition performance,and its clas-sification performance is significantly better than other classification algorithms,which can be used for accurate and rapid recognition of specific cell types.(3)Single-cell sequencing has been applied to a variety of omics studies,such as:transcriptome(sc RNA-seq),epigenomics(sc ATAC-seq).Integrative analysis of different histological data can help to comprehensively characterize the molecular basis of cells and their functions.Most of the existing methods for integration of multi-omics data are based on Euclidean distance and achieve joint analysis by sharing one of the factor matrices,which greatly ignores the heterogeneous relationship of different histology data.Graph-oriented clustering methods are widely used in multi-view clustering analysis because they can effectively learn the heterogeneous relationships and complex structures hid-den in the data.This dissertation focuses on single-cell multi-omics data integration and proposes an implicit adaptive flow single-cell multi-omics integration algorithm(AML)using a multi-view cluster analysis method.In this algorithm,firstly,different omics data are considered as different views,thus transforming the multi-omics data integration into a multi-view clustering analysis that effectively integrates multiple adaptive graphs into a coherent graph with a flow topology.Second,the consistency graph is controlled us-ing effective rank constraints so that its connected components correspond precisely to different clusters.As a result,AML is able to obtain discrete clustering results directly without any post-processing.Finally,the performance of the AML single-cell multi-omics data integration algorithm is evaluated based on simulated and real data,and the results show that AML significantly outperforms other multi-omics data integration algorithms for characterizing the molecular basis of cells and their functions.In conclusion,this dissertation focuses on the key technologies in single-cell sequenc-ing technology:clustering,classification and multi-omics data integration,and designs corresponding models and algorithms to enrich single-cell sequencing analysis technol-ogy.The research approach of this dissertation is of great theoretical and practical value in revealing intercellular heterogeneity,discovering new cell subpopulations,resolving cell lineage differentiation,discovering new markers of disease,and providing personalized precision medicine.
Keywords/Search Tags:single cell sequencing data, machine learning, cluster analysis, Softmax regression, multi-omics data integration
PDF Full Text Request
Related items