Font Size: a A A

Biomarker Discovery Methods Based On Connected Network Regularized Feature Selection

Posted on:2024-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Y LiFull Text:PDF
GTID:1520306923957489Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Research has indicated that the phenotypes of complex diseases result from gene-gene interactions,gene-environment interactions,and synergistic effects within cells.Identifying biomarkers with network topological information from gene-gene interaction networks is crucial for the diagnosis,treatment,and prognosis of complex diseases like cancer.However,identifying biomarkers from biomedical data is an NP-hardness problem,and most existing data-driven machine learning methods yield isolated variables or features.This thesis develops a feature selection method based on biomolecular networks,utilizing transcriptomic data and gene interaction networks for systematic modeling of connected network biomarker discovery.The problem of discovering biomarkers from omics data can be transformed into a feature selection problem in machine learning.Most existing methods do not consider the tendency of genes with similar phenotypic associations to co-localize in specific areas of gene interaction networks,resulting in the identification of isolated features as biomarkers.To address this,we developed a novel embedded connected network regularization feature selection method based on regularization techniques,considering the gene interaction network structure among variables and introducing network connectivity constraints from graph theory into sparse regularization models.This method embeds sparse feature selection and prior network topological information into the machine learning model’s training process,automatically selecting important connected features from high-dimensional variables,which can serve as diagnostic or prognostic biomarkers,providing a systematic and interpretable machine learning and feature selection framework for biomarker discovery.The main research work and innovation points of this thesis are as follows:(1)Due to the importance of biomarkers,there has been much research on biomarker discovery methods.However,research on methods that consider the network structure of feature correlations is relatively scarce,and none of these methods have considered the connectivity of identifying network features.In this paper,based on algebraic connectivity and geometric connectivity of graphs,we first introduce the connectivity inequality that characterizes network structures as a connectivity constraint condition for optimization problems,and propose the Connected Network(CNet)regularized feature selection method with innovation.This method is an important basis and theoretical support for the main research work of this paper.Instead of viewing biomarkers as independent and isolated individuals,it views them as a connected component in a complex network of interdependent and interrelated biological molecules.Based on this,we discuss three typical sparse statistical learning models from the perspective of biomedical problems and provide their corresponding optimization problems with solving methods.In particular,by comparing CNet regularization methods with general regularization methods,we find that CNet provides a deeper understanding of disease-driving factors and biological mechanisms.In addition,we also explore the interior-point algorithm for solving inequalityconstrained optimization problems and propose evaluation metrics for performance comparison of different feature selection methods.(2)To identify connected network diagnostic biomarkers,we establish a Connected Network Regularized Logistic Regression model(CNet-RLR)model based on maximizing classification probability and apply it to identify diagnostic biomarkers for Uterine Corpus Endometrial Carcinoma(UCEC).We regard the "cancer" and "normal" phenotypes in medicine as labels for binary classification problems in machine learning,and propose the CNet-RLR model based on a gene interaction network that reflects a large number of interactions between genes.We also design an interior-point algorithm for solving the optimization problem.The CNet-RLR model is based on the minimum node cut set of the longest diameter path in the gene interaction network,and employs the connected inequality to maintain the inherent topological structure of the network,embedding the estimation process of regression coefficients in the form of imposing connectivity constraints.The results show that the connected network constraints enable CNetRLR to identify a connected network component as biomarkers,outperforming other regularized logistic regression models in terms of structure fidelity and classification performance.This method provides a reference for discovering feature variables with connected network structures using logistic regression models.(3)To identify connected network diagnostic biomarkers,we establish a Connected Networkconstrained Support Vector Machine model(CNet-SVM)based on maximizing geometric margin and apply it to identify diagnostic biomarkers for Breast Cancer(BRCA).The CNet regularization method is integrated into the standard SVM model to establish the CNet-SVM model,which maximizes the geometric margin of different classes while considering the connectivity constraints between genes to ensure the selected feature genes form a connected network component.The CNet-SVM is a structural risk minimization model that uses the maximum geometric margin surface as the optimal classification surface when performing feature selection and classification tasks.By selecting an appropriate threshold to truncate the absolute value of the hyperplane decision parameters determined by CNet-SVM,a connected component of the original network can be obtained.Results from simulated and real-world data show that CNetSVM performs more accurate classification than other regularized SVM models.The identified connected network topology between the network biomarkers deepens our understanding of how genes synergistically execute functions in cancer mechanisms in the form of pathways.(4)To identify connected network prognostic biomarkers,we establish a Connected Network Regularized Cox Proportional Hazards model(CNet-RCPH)by considering patient survival time after surgery and apply it to identify prognostic biomarkers for BRCA.The CNet regularization term is embedded into the Cox proportional hazards model to propose the CNetRCPH model,which studies the prognostic feature variables reflecting breast cancer recurrence or mortality risk.CNet-RCPH is a class of optimization problems with inequality constraints given a risk set,it maps gene expression values one-to-one to a gene interaction network,uses gene expression profiles with a network topology as model input,and then identifies prognostic feature variables.The results show that CNet-RCPH effectively integrates prior prognosticrelated genes and identifies their potential network structures,expanding the identification from isolated feature variables to connected network feature variables.Specifically,the prognostic risk score calculated by CNet-RCPH enables an effective predict of cancer patient survival status,contributing to improved early diagnosis and prognosis assessment for breast cancer patients.Based on the interpretable research framework of "omics data-network integration-feature selection-method application",this thesis establishes mathematical models and devises solution algorithms to propose effective identification methods for diagnostic and prognostic biomarkers.These contributions will advance the research of bioinformatics models and methods,as well as the big data analysis of complex diseases and the development of precision medicine.
Keywords/Search Tags:Biomarker discovery, Feature selection, Network regularization, Gene interaction network, Connectivity constraints, Bioinformatics
PDF Full Text Request
Related items