Font Size: a A A

Newton Methods for Large Scale Problems in Machine Learning

Posted on:2015-07-24Degree:Ph.DType:Thesis
University:Northwestern UniversityCandidate:Hansen, Samantha LeighFull Text:PDF
GTID:2458390005481397Subject:Applied Mathematics
Abstract/Summary:
The focus of this thesis is on practical ways of designing optimization algorithms for minimizing large-scale nonlinear functions with applications in machine learning. Chapter 1 introduces the overarching ideas in the thesis. Chapters 2 and 3 are geared towards supervised machine learning applications that involve minimizing a sum of loss functions over a dataset, but respectively apply to the general scenarios of either minimizing a stochastic convex function or a convex function with an L1 regularizer. Chapter 4 discusses efficient implementations of projected Newton methods for nonnegative tensor factorization.;Chapter 2 outlines a new stochastic quasi-Newton algorithm that incorporates second order information through the L-BFGS approximation of the Hessian. The method's novel element comes from using subsampled Hessian-vector products and averaging to define the L-BFGS curvature pairs. Numerical results on a speech and text classification problem demonstrate the effectiveness of this new algorithm over stochastic gradient descent.;Chapter 3 presents a new active set method for minimizing a convex function with an L1 regularizer. The algorithm follows a two phase approach: an active-set prediction phase that employs first-order and second-order information, and a subspace phase that performs a Newton-like step using sub-sampled Newton-CG. The novelty of the algorithm comes from using an iterative shrinkage step in the active-set phase and a projected piece-wise linear line search in the subspace phase. The new algorithm is compared against a state-of-the-art orthant-wise limited memory algorithm on a speech classification problem.;The fourth chapter concerns nonnegative tensor/matrix factorization with a Kullback-Leibler objective. All presented algorithms start from an alternating block Gauss-Seidel framework and formulate each block subproblem as a sum of independent row functions that only depend on a subset of variables. Minimization of the block problem is executed by the independent minimization of each row function, which is a strictly convex function with nonnegativity constraints. The conclusion is that applying two-metric gradient projection techniques with exact or approximate Hessian information to each of the independent row functions is much more effective than applying the same algorithms directly to the block subproblem.
Keywords/Search Tags:Algorithm, Function, Problem, New, Machine, Minimizing, Block
Related items