Font Size: a A A

Research On Biclustering Algorithm Based On Gene Expression Analysis

Posted on:2022-06-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:1480306311966519Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
With the development of high-throughput sequencing technology,large-scale gene expression data are accumulating at a faster rate.How to use mathematical methods to analyze the massive expression data is a great challenge.One of the critical pieces of information in expression data is the correlation of gene expression.It can be used for various purposes,including candidate disease gene prioritization,functional gene annotation and the identification of regulatory genes.This is extremely important for discovering cancer subtypes,predicting disease-causing genes,and drug screening.But unlike general data,genes involved in the same regulatory mechanism in gene expression data will only have relevant expression levels under certain specific conditions(such as cells belonging to the same tissue),but not relevant under other conditions.That is to say,only some genes are related under some conditions.Traditional cluster analysis methods cannot effectively identify this local correlation of genes.As an improvement to the traditional clustering method,the biclustering method is widely used to identify the local correlation of this gene in the expression data.However,since the biclustering problem itself is at least a NP-hard problem,the current algorithms have their own limitations.Both the types of biclustering that the algorithm can recognize and the accuracy of the algorithm are not ideal.Based on the analysis of gene expression data,in order to deal with the complex types of biclustering,improve the accuracy of biclustering,and optimize the time complexity of the algorithm,this paper proposes two novel biclustering algorithms to solve the limitations of the previous biclustering algorithms,such as limited types of biclusters,poor quality of biclusters and high time complexity.The main work of this research is as follows:Firstly,a new biclustering algorithm RecBic based on the seed expansion strategy of columns is proposed.By identifying the most biologically significant trend-preserving biclusters,the method realizes the goal of identifying multiple types of biclustering.The new seed selection strategy ensures the high accuracy of the algorithm.Meanwhile,recognized the fact that the number of rows is far greater than the number of columns for most gene expression data,the algorithm is designed to have linear time complexity with respect to the number of rows which greatly reduces the running time of the algorithm.We tested RecBic and eight other bicluster algorithms on simulated and real data,respectively.The relevance and recovery score of RecBic on different types of data sets is obviously better than other algorithms.In real data,the average enrichment rate of RecBic is about 12%higher than that of the second ranking algorithm.Secondly,we developed a new algorithm called BicGO.The algorithm defines a brand-new directed acyclic graph model for the expression matrix and converts the process of finding trend-preserving biclusters in the expression data to the process of finding the longest path in a series of directed acyclic graphs.This brand-new model fits the definition of trend-preserving biclusters better than UniBic's longest common subsequence model and perfectly solves the problem that UniBic,which we developed previously,cannot find complexed trend-preserving bicluster.BicGO uses a row-based seeding strategy,which has a lower time complexity of columns than RecBic and is suitable for expressing data with high column dimensions.We also propose a new objective function in BicGO to increase the true positive rate of co-expression genes.BicGO adopts the strategy of seeding by rows,which is lower in column complexity than RecBic,and is suitable for the situation of gene expression data with lots of columns.We compared BicGO with other seven algorithms on simulated data and real data.BicGO was significantly better than other algorithms on simulated data,and BicGO achieved the best f1 score 29%higher than that by the second best on the expression data of five groups of different species where the f1 score was defined asas the harmonious score between biclustering enrichment score and gene enrichment score.Although the combination of these two algorithms performs well in the representation of data,there are still some shortcomings of them.One is that with the development of single-cell sequencing,there have been tens of thousands of rows(genes)and hundreds of thousands of columns(cells)of gene expression data.Neither we nor many bicluster algorithms can cope with such a large scale of gene expression data.On the other hand,there are a lot of missing values in the single-cell expression data,and our bicluster algorithm does not have a good solution when dealing with the data with many missing values.Subsequent work is needed to solve the problem of how to effectively apply the bicluster algorithm to single-cell gene expression data.RecBic is open-source software.The download address is as follows.RecBic website:https://github.com/holvzews/RecBic/tree/master/RecBic/BicGO will be open source soon.It can currently be obtained by contacting the author of RecBic on Github.
Keywords/Search Tags:Gene expression analysis, Bicluster, Combinatorial algorithms, Bioinformatics
PDF Full Text Request
Related items