Font Size: a A A

Learning with multiple kernels: Semidefinite programming, duality, efficient optimization and applications in computational biology

Posted on:2006-01-23Degree:Ph.DType:Thesis
University:University of California, BerkeleyCandidate:Lanckriet, Gert Rene GeorgesFull Text:PDF
GTID:2450390008955668Subject:Engineering
Abstract/Summary:
An important challenge for the field of machine learning is to leverage the diversity of information available in large-scale learning problems, in which different sources of information often capture different aspects of the data. Beyond classical vectorial data formats, information in the format of graphs, trees, strings and beyond have become widely available. For example, in computational biology many such sources of information about genes and proteins are now available: sequence, expression, protein and regulation information. More data types are going to be available in the near future, such as array-based fitness profiles and protein-protein interaction data from mass spectrometry.;Recent work in computational biology (such as gene function prediction; prediction of protein structure and localization, and inference of regulatory and metabolic networks) could benefit significantly from an approach that treats in a unified way the different types of information, merging them into a single representation, rather than only using the description that is judged to be the most relevant at hand.;In this thesis, a principled computational and statistical framework to integrate data from heterogeneous information sources in a flexible and unified way is introduced. The approach is formulated within the unifying learning framework of kernel methods and applied to the specific case of classification. Each data set is represented via a kernel function, which defines a generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and provides a principled framework in which many types of data can be represented, including vectors, strings, trees and graphs.;The resulting formulation takes the form of a semidefinite programming (SDP) problem. Although this implies a polynomial time algorithm; the scale of many real-life problems is often beyond the reach of general-purpose SDP algorithms. Using tools from conic duality and convex analysis, a dedicated algorithm is derived that is significantly more efficient than generic SDP methods in this setting.;Finally, applications to computational biology are presented, showing that classification performance can be enhanced by integrating diverse genome-wide information sources.
Keywords/Search Tags:Computational biology, Information, Efficient, Kernel, Available, Sources
Related items