Font Size: a A A

Key Technology Research On Gene Expression Data Mining

Posted on:2017-10-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:T JiaFull Text:PDF
GTID:1310330566455711Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As an unprecedented breakthrough in experimental molecular biology,DNA microarray enables simultaneously monitoring of the expression level of thousands of genes over many experimental conditions.Studies have shown that analyzing microarray data is essential for finding gene coexpression network,designing new types of drugs,and preventing disease,et al.With the advance of microarray and analysis techniques,it produces big volume of gene expression data and Order-Preserving Sub Matrix(OPSM)mining results.And these datasets cannot be utilized easily and effectively by biologists.Thus,it is urgent to explore and design novel methods and techniques to analyze the rich data resources.Recently,researchers propose a lot of methods to mine the OPSM in gene expression datasets in bulk,which have good behaviors.However,when we face with the parallel and distributed environment having massive and noisy datasets,existing methods have following problems.(1)In parallel and distributed environment,how to reduce the cost of communication,utilize the computing resources effectively,and guarantee the accuracy and complete of the mining results.(2)How to directly retrieve the specific OPSMs form the indexing rather than batch mining each time.(3)How to design indexing and query methods,and make the response time be short.(4)How to employ user-defined constraints to improve OPSM query relevancy and get fast retrieval response.To address the problems mentioned above,i.e.,the local pattern mining,indexing and query from gene expression datasets,we conduct in-depth study,and propose a series of novel mining,indexing and query methods and optimization techniques that applicable to the new situation and requirement.This work was supported in part by the National Basic Research Program 973 of China,the Natural Science Foundation of China,and the Graduate Starting Seed Fund of Northwestern Polytechincal University.The main contribution in the thesis can be summarized as follows:(1)Parallel partitioning and mining gene expression data with butterfly network.It pointed out that it is hard to process in parallel in the existing parallel and distributed systems.To quickly mine OPSM from gene expression data,we give a Butterfly Network based parallel partitioning and mining method.It extends the Hama BSP framework,and makes each node in a superstep exchange data with a specific node,and the maximum super-step number is log2 N.The experimental results show that the novel method extends the weak points of the BSP framework of Apache Hama,reduces the data amount to transfer,and accelerates the process speed.Further,we proof the proposed methods can guarantee the mining results are complete in theory.(2)Keyword based OPSM indexing and query method.Retrieving OPSMs quickly from the massive gene expression data by biologists plays a key role in finding the Physiological function modules,however,which is achieved by the batch mining techniques.To retrieve OPSM directly from the indexing instead of the batch mining,we put forward a prefix-tree based indexing method with row and column header tables,row/column keyword based exact/fuzzy query techniques,and multi-type OPSM query methods.Through extensive experiments,it shows that the proposed method is effective and efficient.(3)OMEGA: An Order-Preserving Sub Matrix mining,indexing and search tool.We design and implement the Order-Preserving Sub Matrix mining,indexing and search tool,named OMEGA,using Butterfly Network,and a prefix-tree based indexing method with column and row header tables.(4)Constrained query of Order-Preserving Sub Matrix in gene expression data.To improve the query relevancy of OPSM,we introduce two query methods based on the enumerating subsequences(es Index)and multi-dimension indices(c Index).It uses the user-defined constraints to search relevant results.We conduct extensive experiments on real datasets,experimental results demonstrate that the two constrained query methods based on c Index and es Index have better performance than the brute force method.In order to further reduce the size of indices,we give constrained query method based on Signature and Trie.And the experiments show the query method is effective and efficient.
Keywords/Search Tags:Gene expression data, Order-preserving submatrix, Parallel mining, Butterfly network, Indexing and query
PDF Full Text Request
Related items