| The cell is the fundamental unit of living organisms and the key to study human biology and diseases,while understanding the mechanisms that govern the production of different cell types has been a significant challenge in biology.Now,with the introduction of advanced technology such as single-cell transcriptome sequencing,biologists have the ability to reveal the heterogeneity of different types of cells at the resolution of a single cell.The Human Cell Atlas Project initiated in this context aims to give us a unique identity for each celltype,a three-dimensional map of how cell types work together to form tissues,knowledge of how all body systems are connected,and insights into how changes in the map underlie health and disease.Naturally,if we regard the Human Cell Atlas as the dictionary of information on all cell types in human body,we desperately need a fast,accurate and reliable method for querying this dictionary.This doctoral dissertation applies machine learning technologies to the complete process of single-cell data analysis,addressing the high dimensionality,high sparsity,and high noise of single-cell data from six aspects: data understanding,data reduction,algorithm design,model optimization,model evaluation,and software deployment.Finally,a generic ensemble classification framework called sc ASK is proposed for high precision and high reliability cross-dataset cell type classification.The main innovative work of this dissertation focuses on the following three aspects:1.Structural reduction method based on matrix adaptive slicing technology.Analyzing the phenomenon of information redundancy and information loss in the process of classification and dimensionality reduction from the perspective of cognitive science.It is pointed out that stretching the two-dimensional numerical matrix into three-dimensional space and then slicing it into a series of two-dimensional binary matrices along the Z-axis direction is an effective data reduction method to simplify the data and retain the structural information.Firstly,according to the numerical range of the original gene expression matrix,a linear or logarithmic transformation is chosen to compress the numerical range to fit the slicing processing range.Then the original gene expression matrix is processed into a series of binary slice matrices using an equally spaced threshold operation.Finally,the inter-slice increment indices are calculated and ranked to determine the optimal slice points that best characterize the structural information.The binary slice matrices corresponding to these optimal slice points,although unchanged in dimensionality from the original gene expression matrix,are numerically greatly simplified to facilitate storage and computation,and spatially retain the expression patterns of the original gene expression matrix at specific thresholds,which is the key information that needs to be captured for subsequent classification learning.2.Differentiated nearest neighbor algorithm based on high-dimensional sparse matrices.The No Free Lunch Theorem and Occam’s Razor Principle in machine learning are important guidelines for classification model selection and classification algorithm design,respectively.Model simplicity does not equate to inefficiency,and data sparsity does not mean the information loss.Firstly,choosing the nearest neighbor algorithm(k NN),which is particularly suitable for multi-label classification of gene expression data,can effectively reduce the risk of underfitting from highly sparse sc RNA-seq data and ensure low bias in the training set.Then Pearson’s correlation distance,Jaccard distance and Cosine distance are chosen as the default distance metrics for k NN,and all three distance metrics have the advantage of being more effective for more sparse the data is.Finally,cross-validation is used to determine the number of neighbors k,distance weighting,the fold number and other empirical parameters.The binary slicing matrices provide differentiated training data.The three distance metrics adapted to the sparse matrix ensure that the nearest neighbor classifiers trained have a certain complementarity.The differentiated nearest neighbor algorithm ensures the effectiveness of subsequent ensemble classification from the algorithm design.3.Meta-classifiers ensemble strategy based on index modes switching.The concept of ensemble in machine learning is derived from the idea of "the Wisdom of Crowds",that is,each participant’s prediction results have their own noise,and combining the prediction results of a large number of participants can potentially offset these noises.Firstly,the training accuracy of the candidate classifier is evaluated in the training set and validation set by cross-validation,and the testing accuracy of the candidate classifier is evaluated in the test set.Considering that the testing accuracy better reflects the true classification ability of the classifier,more weight is given to the testing accuracy in the weighted accuracy.Then,for the three types of classifiers trained on the same slice point,normalize the difference between their training accuracy and testing accuracy on the same scale as an additional evaluation of their generalization ability.Finally,the meta-classifiers with lower prediction bias are selected out from the candidate classifiers using the above weighted accuracy joint generalization ability evaluation metric to construct a classifiers index matrix,and the meta-classifiers participate in the ensemble of classifiers with two index matrix switching modes of local optimum and global optimum to reduce the prediction variance of the final ensemble classifier.This ensemble strategy,called Switching,has achieved remarkable success on real single-cell datasets and enriches the theoretical body of ensemble learning as a novel strategy.In the context of the Human Cell Atlas,this dissertation returns to the starting point of cognitive science,delves into the data analysis mechanism,focuses on original innovation of algorithms,and solves three major challenges plaguing the cell type classification of sc RNA-seq data successively.For the high dimensionality of singlecell data,a structural reduction method is proposed;For the high sparsity of single-cell data,a differentiated nearest neighbor algorithm is proposed;For the high noise of single-cell data,an index modes switching ensemble strategy is proposed.By integrating the above methods,a generic ensemble classification framework called sc ASK was successfully developed to achieve the highest classification accuracy against five baseline algorithms on real single-cell datasets and the best robustness against three competing algorithms in data random missing experiments.In particular,it is worth mentioning that based on the command line version of sc ASK(sc ASKcmd),this dissertation has successfully optimized several linearized command line processes such as data analysis,feature engineering,model tuning,ensemble mode switching and results visualization into graphical and interactive processes by using the latest App Designer technology from the Math Works.The generic ensemble classification software sc ASKapp was developed,which can be directly applied to broader classification or diagnosis tasks including single-cell methylation data,cancer gene expression data,and biomedical imaging data,etc. |