Font Size: a A A

New Chemometric Algorithms In Bioinformatic Studies And Multi-Dimensional Metabolite Data Analysis

Posted on:2018-10-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q YanFull Text:PDF
GTID:1310330542456620Subject:Chemistry
Abstract/Summary:PDF Full Text Request
The research work in this thesis focuses on new chemometric algoritms for bioinformatics and multi-dimensional metabolite data analysis.Most of the proteins locate more than one organelle in a cell.Unmixing the localization patterns of proteins is critical for understanding the protein functions and other vital cellular processes.Herein,nonlinear machine learning technique is proposed for the first time upon protein pattern unmixing.Variableweighted support vector machine(VW-SVM)is a demonstrated robust modeling technique with flexible and rational variable selection.As optimized by a global stochastic optimization technique,particle swarm optimization(PSO)algorithm,it makes VW-SVM to be an adaptive parameter-free method for automated unmixing of protein subcellular patterns.Results obtained by pattern unmixing of a set of fluorescence microscope images of cells indicate VW-SVM as optimized by PSO is able to extract useful pattern features by optimally rescaling each variable for non-linear SVM modeling,consequently leading to improved performances in multiplex protein pattern unmixing compared with conventional SVM and other exiting pattern unmixing methods.Modern biological imaging techniques enable the exhibition of complex subcellular distributions across different organelles for multiplex proteins.Quantifying the fraction of a protein in each cellular compartment is important for arriving at great insights into various protein functions and cell mechanisms.However,the imaging quality is affected by the specific cell type,resulting in loss of significant protein subcellular location pattern-related information.To improve the pattern distinguishing ability,herein,we propose a novel concept of texture descriptors via introducing spatial structures of interested micropatterns for variable-weighting modeling of multiplex protein patterns.Aiming at developing an automatic modeling strategy,particle swarm optimization(PSO)algorithm is also used to optimize the variable weights and other parameters in the models.Such a parameter-free computational system,named as TexVW-MPUnmixing,has been applied in multiplex pattern unmixing of proteins based on modeling a cell fluorescence microscope image set while coupling with linear partial least squares(PLS)and non-linear support vector machine(SVM)separately.The results demonstrate the proposed TexVW-MPUnmixing is able to greatly improve the protein pattern unmixing precision because of the introduction of spatial structure descriptors.It,thus,holds great potential in efficiently automatical unmixing of multiplex protein patterns.Aptamers have exhibited a great potential for research,clinical and industrial purposes.A critical step to realize these applications is to gain high-affinity aptamers specific to interested targets.To facilitate the selection of aptamers generated in systematic evolution of ligands by exponential enrichment(SELEX)process,we propose a novel nucleic acid sequence encoding strategy of Apta-LoopEnc for secondary structural feature extraction of candidate sequences by analysing their delicate substructures in loop regions.Since the unique loop structures of aptamers determine their interaction with targets,encoding their central loop structures directly enables featuring aptamer binding affinity related properties.Additionally,the nucleotide composition of a sequence is also used as descriptors in Apta-LoopEnc to further decrease the description similarity between sequences.The feasibility of Apta-LoopEnc for sequence encoding has been demonstrated by the study of high-affinity aptamer identification against human hepatocellular carcinoma cells.The results indicate the developed Apta-LoopEnc is able to significantly improve the performance of different pattern recognition models.Using the Apta-LoopEnc based support vector machine(SVM)to predict a set of newly designed candidate sequences beyond SELEX has further demonstrated the potential of the developed sequence encoding and prediction strategy in aid of high-performance aptamer design and optimization in an easy,time-saving and cost-effective way via computation,thus,promoting the development of aptamer-related studies and applications.GC-MS urinary metabolomic analysis coupled with chemometrics is used to detect inborn errors of metabolism(IEMs),which are genetic disorders causing severe mental and physical debility and even sudden infant death.Orthogonal partial least squares discriminant analysis(OPLS-DA)is an efficient multivariate statistical method that conducts data analysis of metabolite profiling.However,performance degradation is often observed for OPLS-DA due to increasing size and complexity of metabolomic datasets.In this study,hybrid particle swarm optimization(HPSO)is employed to modify OPLS-DA by simultaneously selecting the optimal variable subset,associated weights and the appropriate number of orthogonal components,constructing a new algorithm called HPSO-OPLSDA.Investigating two IEMs,methylmalonic acidemia(MMA)and isovaleric acidemia(IVA),results suggest that HPSO-OPLSDA can significantly outperform OPLS-DA in terms of the discrimination between disease samples and healthy controls.Moreover,main discriminative metabolites are identified by HPSO-OPLSDA to aid the clinical diagnosis of IEMs,including methylmalonic-2,methylcitric-4(1)and 3-OH-propionic-2 for MMA and isovalerylglycine-1 for IVA.The complexity of metabolic profiles makes chemometric tools indispensable for extracting the most significant information.Orthogonal partial least squares discriminant analysis(OPLS-DA)acts as one of the most effective strategies for data analysis in metabonomics.However,its actual efficacy in metabonomics is often weakened by the excessive variables and few samples.To rectify this situation,hybrid particle swarm optimization(HPSO)is introduced to improve OPLS-DA by simultaneously selecting the appropriate sample weights,the optimal variable subsets,and the best number of orthogonal components(SWVSO)in OPLS-DA,forming a new algorithm named HPSO-SWVSO-OPLSDA.Combined with gas chromatography-mass spectrometer based metabonomics,HPSO-SWVSO-OPLSDA is applied to recognize the patients with IEMs from the healthy controls.Compared with conventional OPLS-DA,HPSO-SWVSO-OPLSDA can not only significantly improve the recognition rate,but identify several most discriminative metabolites to aid the diagnosis of methylmalonic acidemia(MMA)and isovaleric acidemia(IVA).
Keywords/Search Tags:Bioinformatics, Metabonomics, Particle swarm optimization algorithm, Support vector machines, Orthogonal partial least squares discriminant analysis, Variable selection, Variable weighting, Sample weighting
PDF Full Text Request
Related items