Font Size: a A A

A top-down approach for mining most specific frequent patterns in biological sequence data

Posted on:2005-04-13Degree:M.ScType:Thesis
University:Simon Fraser University (Canada)Candidate:Zhang, XiangFull Text:PDF
GTID:2458390008483156Subject:Computer Science
Abstract/Summary:
The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino-acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly out-performs state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.
Keywords/Search Tags:Mining, Patterns, Specific frequent, Data, Biological, Sequence
Related items