DRMacIver's Notebook

Notes on Lagrangian Duality

Published: 2018-11-07

A long standing failure of mine is that I’ve never really understand Lagrangian duality, and it’s come up enough times that this is starting to get embarrassing. The big problem I’ve always had is that I could follow the logic but I didn’t really get why this was at all a sensible thing to do, so it never really stuck in my head.

Today was dedicated study time for me, and the subject came up again, so I decided to have another crack at it. I think I’ve finally understood what’s going on, and that the problem I was having was that people kept filling their explanations with distracting details. I’m sure someone must have explained it to me this way before, for some reason this time I was able to frame it in exactly the right way for me and it clicked. As such, this may not resonate for you.

The trick for me to was to make the problem more abstract to start with.

Suppose we’ve got some abstract set \(D\) and a function \(g: D \to \mathbb{R} \cup \{\infty\}\). Suppose further we’ve got some \(C \subseteq D\). The constrained minimization problem is that we want to find \(p^* = \inf\limits_{x \in C} g(x)\).

In general this is intractable, but if \(g\) and \(C\) have some special structure we may find it pretty easy. e.g. if \(g\) is a convex function and \(C\) is a convex set, you can just differentiate or do gradient descent or whatever.

One way to make optimisation problems more tractable is the idea of relaxations, where you replace the problem with one that is “easier” in the sense that any optimal solution to the harder problem can be turned into a solution to the easier problem that is at least as good, but not vice versa (“at least as good” is important. For example replacing \(g\) with \(g(x) - 1\) is a valid relaxation, though not a very useful one).

The utility of relaxations is two-fold:

they give a lower bound on the optimal solution.
by finding increasingly tight relaxations you may be able to get to the true exact value. e.g. we could add “slop” to the problem that makes it easier by allowing us to get some bits wrong, and then iteratively reduce the amount of slop to zero.

The easiest relaxation is just to minimize \(g\) over all of \(D\) instead of \(C\), throwing away the constraints, but this doesn’t give us a way of tightening up the relaxations easily. Other examples of relaxations include e.g. dropping some constraints, allowing an integer valued variable to take arbitrary reals, requiring the constraints to only be satisfied to within \(\epsilon\).

There is one particularly useful class of relaxations. Let \(\mathcal{F} = \left(\mathbb{R} \cup \{\infty\}\right)^D\) and \(f \in \mathcal{F}\) be such that \(f|_C \leq g_|C\). i.e. \(f\) can be arbitrary outside \(C\) but on \(C\) it cannot ever exceed \(g\). Let \(R_{g, C}\) be the set of such \(f\). Then necessarily \(\inf\limits_{x \in D} f(x) \leq \inf\limits_{x \in C} f(x) \leq \inf\limits_{x \in C} g(x)\) - i.e. such an \(f\) is a valid relaxation, and as such gives us a lower bound on \(p^*\).

Our goal is to choose \(f\) so that it is very large outside of \(C\) and very close to \(g\) inside of \(C\). If we can ensure that \(|f - g|(x) \leq \epsilon\) inside of \(C\) and that \(f\) attains within \(\epsilon\) of its local minimum in \(C\), we can get our approximation to within \(2 \epsilon\),

In fact, we can reduce \(\epsilon\) to zero! Pick \(f(x) = g(x)\) when \(x \in C\) and \(f(x) = \infty\) when \(x \in D \setminus C\). Then we have \(\inf\limits_{x \in D} f(x) = \inf\limits_{x \in C} f(x) = \inf\limits_{x \in C} g(x)\), and so the bound is exact.

Thus we have \(p^* = \max\limits_{f \in R_{g, C}}\inf\limits_{x \in D} f(x)\) - all such \(f\) give us a lower bound on \(p^*\), and there is at least one such \(f\) that achieves that bound.

The problem with this observation is that it is useless on its own - this version isn’t any easier to solve than the previous one!

The thing that makes it easier is that we can now consider a restricted subset of \(R_{g, C}\) where this problem is easier to solve. Unfortunately, this destroys our equality. Suppose we have \(M \subseteq R_{g, C}\), all we can now claim is \(p^* \geq \sup\limits_{f \in M}\inf\limits_{x \in D} f(x)\), as we might have thrown away the values which are sufficiently close to the maximum as to actually attain it.

The Lagrangian gives us such a family of nicer functions. Suppose we can define \(C\) as the solution set to \(\boldsymbol{h}(x) \leq 0\) for some vector valued \(\boldsymbol{h}\). For any \(\boldsymbol{\lambda} \geq 0\) the function \(f_\boldsymbol{\lambda} = g(x) + \boldsymbol{\lambda} \cdot h(x) \in R_{g, C}\), as we know that \(h(x) \leq 0\) and \(\boldsymbol{\lambda} \geq 0\), so we know that on \(C\), \(\boldsymbol{\lambda} \cdot h(x) \leq 0\) and so \(f_\boldsymbol{\lambda} \leq g\).

Suppose we can calculate \(t(\boldsymbol{\lambda}) = \inf\limits_{x \in D} f_{\boldsymbol{\lambda}}(x)\) (literally we suppose that. In general we can’t, but if \(f\) and \(h\) have some nice form then this is now a simple unconstrained optimisation problem). By our above discussion we know that \(p^* \geq \sup\limits_{\boldsymbol{\lambda}} t(\boldsymbol{\lambda})\). This latter quantity is called the dual solution, \(d^*\), and we have proved the primal-dual inequality that \(d^* \leq p^*\).

Why is this a remotely useful thing to do?

Well the main reason is that calculating the dual solution can be much easier for two reasons:

It has much simpler constraints - the only constraint is that \(\boldsymbol{\lambda} \geq 0\), so the domain of our optimisation is a particularly simple convex set.
It is the maximization of a concave function. We defined it as the infimum of a number of linear functions (of \(\boldsymbol{\lambda}\). They may be highly non-linear in \(x\), but that doesn’t matter!), so it is the infimum of a set of concave functions and thus concave.

What this means in particular is that what we have is a convex optimisation problem (the function is concave, but we’re maximizing it so it’s convex optimisation)! Convex optimisation is easy, so calculating the dual is hopefully easy once you’ve got to this point.

Note that this is true regardless of literally anything about the original problem - no structure or continuity is assumed (although good luck calculating \(t\) if you don’t have some structure and continuity).

Things I still do not understand:

When the primal-dual gap is zero (i.e. when the inequality is an equality). I’m aware of some of the theorems about this but I need to study them.
What the actual interpretations of the lagrange multipliers (\(\boldsymbol{\lambda}\)) means. I’ve seen a bunch of things about “shadow prices” and I had some idea what that meant, but I don’t yet understand how that fits in with this interpretation.