Font Size: a A A

Research On Methods Of Analyzing Biological Data Based On Feature Synergy

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:2370330626460374Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of computer technology,bioinformatics has made great progress.How to extract useful information from biological data is a problem that bioinformatic researchers need to solve.Studies have shown that the study of genes,proteins,metabolites and other features of organisms from a synergy perspective contributes to the in-depth understanding of the biological mechanism.In this thesis,a random forest algorithm based on the feature combination,RF-FC,is proposed.Based on the phenomenon that the combination of features can reflect more macroscopic and systematic changes of organism,not only each feature's distinguishing ability will be investigated when a base decision tree grows,but all feature pairs' and all feature triples' distinguishing abilities will also be investigated using linear support vector machine,the one which split node best will be chosed.The experiment on 14 public data sets shows that the classification performance of RF-FC is superior to that of random forest in most cases.In this thesis,an improved LC-k-TSP algorithm called LC-k-TSP-PlattCE,is proposed.It is based on Platt scaling and feature pair's score.In the decision-making stage of LC-k-TSP,classifiers' confidences on the unknown sample are calculated using Platt scaling algorithm,then they are weighted based on the first scores.LC-k-TSP-PlattCE retains the advantages of LC-k-TSP,it adopts k>0 feature pairs' linear relations to construct an ensemble classifier,the classification criterion is simple and it is easy to explore the biological explanation.The experiment on 11 public data shows that the performance of LC-k-TSP-PlattCE is better than those of LC-k-TSP and support vector machine in most cases.In this thesis,based on multiple combination relationships among features,an algorithm for biological network constructing and module biomarker discovery,MCR-Net,is proposed.In this algorithm,features are taken as nodes of the network,every two nodes are examined by four combination forms: ‘+',‘-',‘×' and ‘÷'.For each combination form,one-way analysis of variance is conducted,and the best one is selected to measure the synergy of the corresponding two nodes.The p value of the best combination form of the two nodes is used as the edge's weight.As a result,a biological network that can reflect the physiological and pathological changes of organisms is constructed.Using the greedy strategy,the important network modules are found.Based on their distinguishing abilities,and the predictionconfidences of support vector machines which are constructed on them,the information of multiple modules is integrated to perform the classification.Experiment on 18 public data sets shows that the proposed MCR-Net algorithm has a better performance than other popular biomarker selection algorithms in most cases.The three analytical algorithms proposed in this thesis are all based on the synergy among biological features,and they have a strong application value in the study of biomarkers and prediction.The comparison of them shows that,the performance of classification model based on MCR-Net is the best in most cases,while the interpretation of LC-k-TSP and RF-FC is better than that of MCR-Net.
Keywords/Search Tags:Bioinformatics, Biomarkers, Synergy, Feature Selection, Classification
PDF Full Text Request
Related items