Font Size: a A A

Study On Some Data Mining Methods For Gene Expression Data

Posted on:2008-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:X L TangFull Text:PDF
GTID:2178360215974791Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow monitoring of gene expression for tens of thousands of genes in parallel. The gene expression data are organized as matrices. These matrices have to be analyzed further, if any knowledge about the underlying biological processes is to be extracted. The gene expression data mining is a hot and difficult topic in Bioinformatics.In the gene expression data analysis, biclustering and mining frequent patterns are important operations. By biclustering, genes are grouped into different clusters. The genes in the same cluster have the same feature. According to the known function genes, function of other genes in the same cluster is concluded. By the analysis of association rules in gene expression data, the strong or poor relations of genes in several samples are found. The relations among them can be gotten by mining their frequent patterns.As gene expression data analysis consumes huge amount of computation time and memory space, high performance computer is required. In this paper, we research on the following issues deeply and propose some relative resolving algorithms. These methods are proved concise and efficient.First, the algorithm ACA_biclustering is presented to solve the determinate biclustering problem. The ant colony algorithm (ACA) is applied to the biclustering problem. Given the parameter K, K biclusters which meet the threshold value are attained. The N rows and M columns in the matrix are coded as a string, the length of which is N+M. The string denotes the bicluster which is embedded in the expression data. The row or column which is in the bicluster is coded as 1, or else 0. Every character of the string corresponds to a node. The more intensity of the pheromone on the node is, the more probability to be selected it has. First, K initial strings are listed. According to the standard whether the bicluster quality can be improved by adding or removing a node, we choose the node from the K nodes in the first column, until the last column N+M. So every ant's solution corresponds to the improved K biclusters. Comparing the results of all ants, we keep the best result. Then the selection iteration continues until the result is no longer improved. Compared with the similar algorithm, our algorithm can obtain more accurate result and has higher computational speed and efficiency.Second, we propose a merged determinate biclustering algorithm (MDBC) to solve the undeterminate biclustering problem. In the gene expression data, the number of rows is far more than that of columns, we only bicluster the every two columns. The prune strategy is used to remove useless biclusters. Then an index tree is constructed. According to the anti-monotonicity of this kind of biclusters, based on the merged method, we get the biclusters which meet the threshold value from the small biclusters that include the least columns. To the increment data, it is no need to biclustering all the data. The merged method is still used. Through the way, only the increment data are biclustered, then the existing biclusters are combined, thus all the biclusters in the incremental data can be obtained faster than before.Furthermore, we deeply study the problem of mining frequent closed patterns in the gene expression data and present the algoritm---efficient mining frequent closed patterns from high dimensional data (EMHCP). The transaction set notion is applied to the gene expression data, and the different experiments correspond to transaction set. Aiming at the high dimesion in the expression data, a bit map table is constructed to get the itemsets existing in the two transactions fast. With the itemsets, a compound tree of rowset and itemset is constructed. While constructing it, efficient pruning strategy is used to reduce the search space and accelerate the mining speed. Then by searching the tree in the depth first order, all frequent closed patterns can be mined.
Keywords/Search Tags:gene expression data, data mining, biclustering, association rule, frequent pattern, ant colony algorithm
PDF Full Text Request
Related items