Constrained Optimization Overview

This tutorial provides an overview of constrained optimization problems, and how this relates to Deep Learning. We will cover problem formulation, ….

Formulation

In humancompatible-train, and in Constrained Machine Learning more generally, we are interested in solving problems of the form:

\[\begin{split}\min_{x\in\mathbb{R}^n} \quad & \mathbb{E}[f(x,\xi)] \\ \text{s.t.} \quad & \mathbb{E}[g(x,\xi)] \leq 0, \\ & \mathbb{E}[h(x,\xi)] = 0, \\\end{split}\]

where \(f\) is the objective function we want to minimize, \(g\) are the inequality constraints, and \(h\) are the equality constraints. The expectation is taken over some random variable \(\xi\), which represents the data.

You may recognize the first line of the above formula as the standard formulation of a machine learning problem, where we want to minimize the expected loss over the data. We then introduce constraints – they could express anything from some bounds on the weights of the model, or a requirement on the model’s predictions to satisfy some fairness criterion, to the boundary conditions of a physical system.

Note

As is standard in the field, we adopt the convention of writing the constraints as \(g(x) \leq 0\), and \(h(x) = 0\). This is just a notational choice, and does not affect the generality of the formulation. It is trivial to transform \(g(x) \geq 0\) into \(-g(x) \leq 0\), or \(h(x) = \epsilon\) into \(g(x) - \epsilon = 0\) for some \(\epsilon\). We refer to this \(\epsilon\) as the constraint bound, or threshold.
It is also easy to switch between equality and inequality constraints: to get \(g(x) = 0\), one can set \(-g(x) \leq 0\) and \(g(x) \leq 0\) simultaneously. In fact, different algorithms are designed to handle either equality or inequality constraints natively, but it is trivial to switch between the two. We shall see more concrete examples later on.

Solving Constrained Problems

We know that to solve an unconstrained optimization problem, we can use gradient descent, or any of its myriad variants. But how do we solve a constrained optimization problem?

The Constrained Machine Learning field seems to have converged on Lagrangian-based methods, which utilize the Lagrangian function to transform the constrained problem into an unconstrained one.

Going forward in this tutorial, we will focus on the deterministic case to simplify notation; the stochastic case is more complex, but utilizes the same principles (think full-batch vs. mini-batch Gradient Descent). For more rigorous mathematical treatment of the stochastic case, see TODO, as well as the references included in the documentation for each of the algorithms.

So, we have the following constrained problem:

\[\begin{split}\min_{x\in\mathbb{R}^n} \quad & f(x,\xi) \\ \text{s.t.} \quad & g(x,\xi) \leq 0, \\ & h(x,\xi) = 0, \\\end{split}\]

In a deterministic case, the Lagrangian function is defined as follows:

\[\mathcal{L}(x, \lambda, \mu) = f(x) + \lambda^T g(x) + \mu^T h(x)\]

where \(\lambda\) is the Lagrange multiplier associated with the constraint \(g(x) \leq 0\), and \(\mu\) is the Lagrange multiplier associated with the constraint \(h(x) = 0\).

It is then possible to show that the original constrained problem is equivalent to the following unconstrained problem:

\[\min_{x\in\mathbb{R}^n} \max_{\lambda \geq 0, \mu} \mathcal{L}(x, \lambda, \mu)\]

We refer to the original problem as the primal problem, with \(x\) as the primal variables, and to the transformed problem as the dual problem, with \(\lambda\) and \(\mu\) as the dual variables. The dual problem is unconstrained, and can be solved using a clever application of standard optimization techniques.

In particular, we can use alternating updates: fix the primal variables, and optimize the dual variables using gradient ascent; then fix the dual variables, and optimize the primal variables using gradient descent. This process is repeated until convergence.

In humancompatible-train, we implement several variants of this approach, based on methods present in the literature. For more details, see the corresponding documentation; for now, it is important to understand that they are all based on the same principle of alternating updates to the primal and dual variables.

In the simplest case of the Lagrangian method, this gives us the following update rules:

\[\begin{split}\lambda_{t+1} & = \lambda_t + \beta \nabla_\lambda \mathcal{L}(x_{t}, \lambda_t, \mu_t) = \lambda_t + \beta g(x_{t}) \\ \mu_{t+1} & = \mu_t + \gamma \nabla_\mu \mathcal{L}(x_{t}, \lambda_t, \mu_t) = \mu_t + \gamma h(x_{t}) \\ x_{t+1} & = x_t - \alpha \nabla_x \mathcal{L}(x_t, \lambda_{t+1}, \mu_{t+1})\end{split}\]

where \(\alpha\), \(\beta\), and \(\gamma\) are the learning rates for the primal and dual variables, respectively.

Note

The above update rules are for the simplest variant of the Lagrangian method. The methods implemented in this package are all more complex. Even beyond our implementation, one can (and sometimes should!) modify the update rules by e.g. tweaking the training loop code, as we show in the Tips and Tricks tutorial.
The Lagrangian approach is by far not the only way to do constrained optimization. It is, however, the best fit for a large-scale deep learning setting thanks to its “alternating updates” interpretation, which allows one to use well-established first-order iterative algorithms. We are still looking to implement other families of methods in humancompatible-train, such as SQP-based solvers!

In our package, the dual optimizers handle the updates to the dual variables, while the primal updates are handled by the standard PyTorch optimizers. This allows for seamless integration of constraints into the training loop, as we will see in the next tutorial.