Font Size: a A A

Prediction/classification techniques for sequences and graphs

Posted on:2005-09-22Degree:Ph.DType:Thesis
University:University of MinnesotaCandidate:Deshpande, MukundFull Text:PDF
GTID:2458390008994682Subject:Computer Science
Abstract/Summary:
Recent years have seen a tremendous growth in the availability of data from scientific projects like the Human Genome Project (http://www.ornl.gov/hgmis/) that plans to sequence the entire human genome, or the Protein Data Bank (http://www.pdb.org) that provides a single repository for 3-D biological macromolecular structure data. With this large scale availability of data, researchers are now shifting focus from generating and assembling the data to analyzing and understanding this data. One of the challenges in analyzing such scientific datasets is that most of them tend to have strong sequential, relational, and/or spatial characteristics. Therefore representing such datasets in an attribute value format (required by most data mining algorithms) is not only tedious, but also non-intuitive.; This thesis approaches the problem of mining scientific datasets by using representative data-types to encode the information in scientific datasets, and developing data mining algorithms for such data-types. Such data mining algorithms can then be used to solve problems in diverse scientific domains. Towards this goal the thesis uses two data-types: sequences and labelled graphs. enough to capture the important dataset characteristics for most scientific applications. The thesis develops three data mining algorithms for mining these data-types: (i) sequence prediction, (ii) sequence classification, and (iii) graph classification.; A new technique called Selective Markov Model (SMM) is presented for the sequence prediction problem. This technique achieves an extremely compact representation without compromising the predictive accuracy of the model.; For the sequence classification problem this thesis presents a feature representation for sequences and proves that the traditional Markov chain based classifier is a linear discriminant operating in this feature space. This realization allows the use of powerful classification techniques in conjunction with Markov Chains.; Finally, the thesis presents a graph classification approach that uses novel subgraph based features coupled with effective feature selection. The subgraph based features capture both the connectivity as well as the position information of the vertices in the graph.
Keywords/Search Tags:Data, Graph, Sequence, Classification, Scientific
Related items