Font Size: a A A

Maximum entropy modeling for distributed classification, regression and interaction discovery

Posted on:2010-09-30Degree:Ph.DType:Dissertation
University:The Pennsylvania State UniversityCandidate:Zhang, YanxinFull Text:PDF
GTID:1448390002476679Subject:Engineering
Abstract/Summary:
The maximum entropy (ME) principle has been widely applied to specialized applications in statistical learning and pattern recognition. The concept of ME method is to find a probability distribution that satisfies whatever information is available from known data in the form of constraints. The ME solution is the unique Gibbs distribution that maximizes the likelihood of the training data. In this dissertation, we develop ME methods with applications to three important tasks, i.e., distributed classification, regression, and identification of feature interactions.;In the distributed classification paradigms, where common labeled data may be not available for designing classifier ensemble, traditional fixed decision aggregation such as voting, averaging, or naive Bayes rules could not account for class prior mismatch or classifier dependencies. Previous transductive learning strategies have several drawbacks, e.g., feasibility of the constraints was not guaranteed and heuristic learning was applied. We overcome these problems by proposing a transductive maximum entropy (TME) model for designing aggregation to satisfy the constraints in local classifiers. We augment the test set support to ensure the feasibility of the constraints and develop transductive iterative scaling (TIS) algorithm for optimal solution. This method is shown to achieve improved decision accuracy over the earlier transductive approaches and fixed rules on a number of UC Irvine data sets.;Typically, ME models have been developed for classification on discrete feature spaces, i.e., both the output variable and input features are categorical or ordinal. We extend ME model for the regression problem, where the output variable and input features are mixed continuous-discrete valued. We propose a hierarchical maximum entropy (HME) model for regression in building a posterior model for the output variable, which encodes constraints involving hierarchical derived features that are obtained by agglomerative clustering of both input features and the output variable. We develop a greedy order-growing constraint search method to sequentially build constraints with flexible order into the HME model based on likelihood gain on a validation set. Experiments show the HME model for regression performs comparably to or better than other regression models, including generalized linear regression, multi-layer perceptron, support vector regression, and regression tree.;Individual variation in risk for complex disorders results from the joint effects of both environmental and genetic factors. There are statistical, computational, and methodological challenges associated with discovery of gene-gene and gene-environment phenotypic interactions. We propose maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search -- that makes explicit and is determined by the interactions that confer phenotype-predictive power. The model structure and order selection are based on the Bayesian Information Criterion (BIC), which accounts for the finite sample in (fairly) comparing interactions at different orders and in determining the number of interactions. We develop a fast approximate search algorithm using cross entropy, achieving improved sensitivity and specificity of ground-truth markers and interactions when tested on real genotyped data with up to 1000 SNPs and 20 or less predisposing variants, including interactions up to fifth order.
Keywords/Search Tags:Maximum entropy, Regression, Model, Distributed classification, Interactions, Data, Output variable
Related items