This file starts with Exercises, which ends at exercise 10 and is followed by "Notes: Applied Optimization".

Search for "Exercise 9" for the Regression cost function minimization exercises.

Greetings Nikolaj Kuntner

Exercises - Applied Optimization

Exercise 1.1

For the next proof we need the following parallelogram identity. Prove that for any two vectors ${a,b}\subset \mathbb{R}^n$ $|a-b|^{2} = |a|^{2} - 2,a^\top b + |b|^{2}.$

Solution 1.1

$a, b \in {\mathbb C}$ $|| a+d||^2 := (a+d)^\cdot (a+d) = a^\cdot a + (a^\cdot d + a^\cdot d) + b^\cdot d$ (Associativity is proven by the commutativity of sum symbols.) Let $d:=-b$ and consider real $a, b$ $|| a-c||^2 = (a^-c^)\cdot (a-c) = a^\cdot (a-c)-c^\cdot (a-c) = a^\cdot a + 2, a^T\cdot c + c^T\cdot c$

TODO: Check variable names and signes of the above

Note: We can show that for $a, b \in {\mathbb C}$, the middle bracket equals the real part of $a^*\cdot d$ Note: Also important for parallelograms like here is the parallelogram identity, which is a relation purely in terms of norms (not pure scalar projects like here) Note: This, and related parallelogram formulas, can be understood as (or understood as using as) generalized Pythagorean theorems.

Exercise 1.2

For theorem 2.2, prove the direction (b)->(a) and give a geometric visualization.

Solution 1.2

For $x,y,z\in {\mathbb R}^n$, let $Q := | x-y|^2 - | x-z|^2$, which is like an offset quartic in $y$. $Q = | (x-z) + (z-y)|^2 - | x-z|^2 = | z-y|^2 - 2, (x - z)\cdot (y - z)$ So $(x - z)\cdot (y - z) \le 0 \to \left(Q\ge 0 \land (Q\le 0\leftrightarrow Q=0 \leftrightarrow y = z)\right)$

Note that $Q\le 0$ is equivalent to $d(x, y) \le d(x, z)$.

$C\subset {\mathbb R}^n$ ... non-empty, closed, convex
${\bar z}\in C$

Claim: $\forall (y\in C). (x - {\bar z})\cdot (y - {\bar z}) \le 0 \implies d(x, {\bar z}) = {\mathrm{min}}{d(x, y)\mid y\in C}$

Firstly, $d(x, {\bar z})$ is in $D$. But then the (rhs) consequence can be expressed as $\forall (y\in C). d(x, y) \le d(x, {\bar z}) \to y = {\bar z}$

?Which of those still hold in infinite dimensions?
?Which other distances can be we use?

Exercise 2.2

Show (a) $P_C \circ P_C = P_C$, and (b) $P_C$ maps $\mathbb{R}^n$ onto $\overline{C}$.

Solution 2.2

Note that $c\in {\bar C}\to P_C(c)=c$, i.e. ${P_C}{|{\bar C}}=Id_{\bar C}$.

a) As $P_C(x)\in {\bar C}$ for all $x$, $P_C\circ P_C = Id_{\bar C}\circ P_C = P_C$.

I think (!) one can also use the scala-product characterization and consider the 2-ary predicate $z=P(x)$.

b) Since ${\bar C}\subset{\mathbb R}^n$, the projection is onto ${\bar C}$.

Note: There's a few TODOs to make the solution of 2.2 much more formal/exact:

State the map precisely: $P_C:\mathbb{R}^n\to\overline{C}$; clarify that “onto” means $\mathrm{Im}(P_C)=\overline{C}$.
Replace “$\overline C\subset \mathbb R$” with $\overline C\subset \mathbb R^n$; fix “scala-product” → “scalar product”.
Prove surjectivity by two inclusions: (i) $\mathrm{Im}(P_C)\subseteq\overline C$ is immediate; (ii) for any $z\in\overline C$, take $x=z$ to get $P_C(z)=z$, hence $z\in\mathrm{Im}(P_C)$.
Add the 1-line idempotence proof using the characterization: with $z:=P_Cx\in\overline C$, $\langle z-z,,y-z\rangle=0\le 0$ for all $y\in\overline C$ ⇒ $P_Cz=z$.
Optionally include an alternative argument via the Pythagorean inequality $|x-y|^2\ge|x-P_Cx|^2+|y-P_Cx|^2$ to justify both steps cleanly.

Exercise 2.3

Show that the projection map $P_C$ onto the closure $\overline{C}$ of a convex set $C \ne \emptyset$ is continuous.

Solution 2.3

There's a few definitions for a function $f$ being continuous. In the most abstract definition of continuous, we gotta show that for an open set $U_{\bar C}\subset {\bar C}$ in the image of the function, the inverse image $f^{-1}(U_{\bar C})$ is open. There's also "If a sequence $x_i$ has a limit point $x_L := \lim_{i\to\infty} x_i$, then $f(x_i)$ has limit $f(x_L)$." But here we can use the stronger Lipschitz-continuity. "A map $f:\mathbb{R}^n\to\mathbb{R}^m$ is pointwise Lipschitz at $x_0 \in \mathbb{R}^n$ if there exist $ L \ge 0$ and $r>0$ such that for all $x$ with $ | x - x_0 | < r $, we have $| f(x)-f(x_0)| \le L, | x-x_0| $."

Note that this is almost trivially true when the LHS is zero (such as when $x=x_0$), so one only needs to consider the cases where it's possible can divide by $f(x)-f(x_0)$.

Let $x, x_0\in\mathbb{R}^n$, and set $z:=P_C(x)$ and $z_0:=P_C(x_0)$ denote two points in $\overline C$. We'll prove the $1$-Lipschitz estimate $|P_C x - P_C x_0| \le |x - x_0|$. We can actually prove the stronger lemma. $|z-z_0|^2 \ \le\ \langle x-x_0,,z-z_0\rangle \ \le\ |x-x_0|,|z-z_0|$ (Dividing by $|z-z_0|^2$ gives the result.) The RHS side of this is just Cauchy-Schwarz, so we actually only need $\langle z-z_0,,z-z_0\rangle \ \le\ \langle x-x_0,,z-z_0\rangle$ or $\langle (z-z_0) - (x-x_0),,z-z_0\rangle \ \le\ 0$ or $\langle x_0-z_0,,z-z_0\rangle - \langle x-z,,z-z_0 \rangle \ \le\ 0$ or $\langle x_0-z_0,,z-z_0\rangle + \langle x-z,,z_0-z \rangle \ \le\ 0$

Now each terms are $\le\ 0$ independently, using the scalar product characterization for min-distance to the convex set (used $z_0:=P_C(x_0)$ and $y:=z$, resp $z:=P_C(x)$ and $y:=z_0$).

Note: Also here there's a few few TODOs I'd like to add

Announce the target inequality up front: $|P_Cx-P_Cx_0|\le|x-x_0|$ (global 1-Lipschitz ⇒ continuity).
Name the key lemma (firm nonexpansiveness): $|P_Cx-P_Cx_0|^2 \le \langle x-x_0,,P_Cx-P_Cx_0\rangle$.
Show the inner-product step explicitly: expand $\langle x-x_0,,z-z_0\rangle=\langle x-z,,z-z_0\rangle+\langle z-z_0,,z-z_0\rangle+\langle z_0-x_0,,z-z_0\rangle$ and bound the first/last terms by $0$ using the characterization.
Conclude with Cauchy–Schwarz and the divide-by-$|z-z_0|$ case split ($=0$ vs $>0$).
Standardize symbols ($x,x_0,z,z_0$ throughout) and fix spelling (“Lipschitz”, “we have”).

Exercise 2.4

For $v \in \mathbb{R}^n \setminus {0}$ and $\gamma \in \mathbb{R}$, define the hyperplane $H_{v,\gamma} = { y \in \mathbb{R}^n : v^\top y = \gamma }$. Let $M \subset \mathbb{R}^n$ and $x \notin M$. Assume that $H_{v,\gamma}$ separates $x$ from $M$, i.e. $v^\top z \le \gamma$ for all $z \in M$, and $v^\top x > \gamma$. Show that there exists $\gamma' \in \mathbb{R}$ such that $v^\top x > \gamma' > v^\top z$ for all $z \in M$ (so the hyperplane $H_{v,\gamma'}$ strictly separates $x$ from $M$).

Solution 2.4

Let $\sigma := \sup{ v^\top z : z \in M }$. Since $v^\top z \le \gamma$ for all $z \in M$, and $\gamma < v^\top x$, we have $\sigma \le \gamma < v^\top x$, hence $\sigma < v^\top x$. Define $\gamma_t := (1-t)\cdot \sigma + t\cdot v^\top x$ with any fixed $t\in(0,1)$. (E.g. $t=\tfrac{1}{2}$ gives the middle point between them.) $\gamma_t$ is bigger than the supremum $\sigma$ by construction. The hyperplanes $H_{v,\gamma_t}$ are all strictly between them.

The TODO's here:

Define $\sigma:=\sup{v^\top z: z\in M}$ and note $\sigma\le\gamma<v^\top x$ ⇒ $\sigma<v^\top x$.
Choose a parameterized threshold $\gamma_t:=(1-t)\sigma+t,v^\top x$; record the admissible set $t\in(0,1)$ for strict separation, with comments on $t=0,1$ (only non-strict).
Add a one-line verification: for all $z\in M$, $v^\top z\le\sigma < \gamma_t < v^\top x$ implies $H_{v,\gamma_t}$ strictly separates.
Mention edge cases briefly: if $M=\varnothing$ the statement is trivial; $v\neq 0$ is essential; normalization of $v$ is optional (just rescales $\gamma_t$).
Optionally add a geometric sketch note (picture of levels of $v^\top y$) to build intuition.

Exercise 2.5

(a) Suppose for some ${\hat{x}, c} \subset \mathbb{R}^n$ and $\eta \in \mathbb{R}$ we have $c^\top x \le \eta = c^\top \hat{x}$ for all $x \in \mathbb{R}^n$. Show that this implies $c = 0$ and $\eta = 0$.

(b) Likewise, show that $c^\top z \le \gamma$ for all $z \in \mathbb{R}^n$ (and some finite $\gamma \in \mathbb{R}$) implies $c = 0$ and $\gamma \ge 0$.

Solution 2.5

Again, $\eta = c^\top \hat{x}$. (a) Consider $x=\hat x + t\cdot c$ with $t>0$. Towards a contradiction, assume $c\ne 0$. We then have, $\forall x\in\mathbb R^n$, $\eta\ge c^\top x = c^\top \hat x + t|c|^2 = \eta + t|c|^2 > \eta$ $\eta > \eta$ is a contradiction, so $c=0$. And then also $\eta=0$.

(b) Suppose $c^\top z \le \gamma$ for all $z\in\mathbb R^n$ with finite $\gamma$. Take $d_t=t\cdot c$ with $t>0$. Then $c^\top d_t = t|c|^2 \to \infty$ for growing $t$ (and in particular overtaking finite $\gamma$) iff $c\ne 0$. If $\gamma$ was finite, then $c^\top z \le \gamma$ thus implies $c=0$. As $0=c^\top 0 \le \gamma$ comes up to $\gamma$ from below, we have $\gamma\ge 0$.

Finally, the TODO's here:

In (a), streamline the contradiction: assume $c\neq 0$, take $x=\hat x+t c$ with $t>0$; compute $c^\top x=\eta+t|c|^2>\eta$; conclude $c=0$ and then $\eta=0$.
In (b), emphasize the “finite $\gamma$” hypothesis; if $c\neq 0$, pick $z=t c$ to get $c^\top z=t|c|^2\to\infty$, contradicting boundedness.
Add the final sign check: with $c=0$, $c^\top 0=0\le\gamma$ ⇒ $\gamma\ge 0$; note that the minimal such $\gamma$ is $0$.
Clarify quantifiers (“for all $x$”, “for all $z$”) and remove extraneous phrases (“for growing $t$”)—state limits or inequalities precisely.
Optional extension: remark that any two-sided global bound $|c^\top z|\le \Gamma$ also forces $c=0$ by the same scaling argument.

Exercise 3.1 (2.6 on p.9)

Let $C\neq\varnothing$ be closed and convex and let $K$ be a compact convex set disjoint from $C$. Select $(\bar x,\bar z)\in K\times C$ such that the distance achieves the gap, i.e. $d(\bar x,\bar z)=\delta(K,C)$. Define $v:=\bar x-\bar z$ and $\gamma:=v^\top\bar z$.

Prove the following two statements:

(16) $(\bar z-\bar x)^\top(x-\bar x)\le 0\quad\text{for all }x\in K.$

(17) $v^\top x\ \ge\ v^\top\bar x\ >\ \gamma\ =\ v^\top\bar z\quad\text{for all }x\in K.$

Solution 3.1 (2.6 on p.9)

(16) To show is $v^\top(x-\bar x)\ge 0$ $v^\top$ is the vector pointing from one convex body to ther other, from $C$ to $K$. $x-\bar x$ is a vector inside $K$, with $\bar x$ the point closest to $C$. Graphically it's clear that the angle between those is smaller than $\tfrac{\pi}{2}$, i.e. that the angle is positive. Formally, manipulate the defining convexity property of $K$ (The scalar product, Theorem 2.2 or Theorem 2.3).

(17) Subtract $\gamma$ from both inequalities and again identify $v$ where possible. The left inequality follows from (16) and the right just says $v^\top v > 0$.

TODO (16), (17): I'm not sure if Theorem 2.2 or 2.3 is used. The example might have implicit simplifications in it, by $\bar x$ existing in he first place.

Exercise 3.2 (2.7 on p.9)

Setup $\star$: Boundedness of $K$ is important in above result, otherwise strict separation is impossible. As an example, consider $K={[t,0]^\top \in \mathbb{R}^2: t\ge 0}$ and $C={[x,y]^\top \in \mathbb{R}^2: y\ge \tfrac{1}{x},\ x>0}$.

Show

(a) $K$ and $C$ are closed and convex.

(b) $C\cap K=\varnothing$.

(d) there is no hyperplane strictly separating $K$ and $C$.

Solution 3.2 (2.7 on p.9)

The figures are:

$C$ positive x-axis.
$K$ volume over the boundary $b_y(x) := \tfrac{1}{x}$.

(a)

disjunct ... clear
$K$ convex ... clear
$C$ convex ... clear with $b_y''(x) > 0$. Can also be shouwn with the line-stays-in-volume characterization.
No intersection, since $y$ is never zero for points in $C$
Closed ... due to limits in it. TODO: A proof atm missing
Plane would have to be horizontal, but $b_y$ calls under any distance from $K$

Exercise 3.3 (2.8 on p.9)

Refering to the sets in Setup $\star$, Show that for any $x\in K$, the closest point $z=P_C(x)$ has $P_K(z)\ne x$.

Solution 3.3 (2.8 on p.9)

Note: The statement here is that $P_K\circ P_C$ has no fixed-point.

$P_C(x)-x$ points updates and has positive x-component. $P_K(z)-z$ points downwards has zero x-component. Chaining them has positive component. Here, we can even compute it explicitly: The down-projection is basically coordinate projection, the up-projection points (see from down-point) up along the normal.

Exercise 3.4 (2.9 on p.9)

Refering to the sets in Setup $\star$, Is there a weakly separating hyperplane in the sense that there exists $(v,\gamma)\in \mathbb{R}^3\setminus{[0,0,0]^\top}$ such that $v^\top x \ge \gamma \ge v^\top z$ for all $z\in C$ and all $x\in K$?

Solution 3.4 (2.9 on p.9)

Approach: Existence of it would imply positive distance?? TODO

Note: Weak existence does exist for some broad class of conv spaces, even when they aren't bounded. (TODO: Validate this.)

Exercise 2.12*

(Note: This is not the same as 2.12 in the script.)

Justify the w.l.o.g. assumption in the proof of Theorem 2.8, namely that the zero vector $0$ lies in the convex set $C$.

Solution 2.12*

Roughly (I took this from the Latina looking gal in class):

"The $T_a \colon \dots$

$C' := T_a(C) = ... \implies 0 = a-a \in C' | d(x-a, y-a) = d(x,y)$

obdA $O\in C$ ... write down/use formula for convexity with $\lambda$s."

Main hint: Translation invariance, indeed statements about bijections

Exercise 2.13*

(Note: This is not the same as 2.13 in the script, but in this case we need to reference it to a good extent)

Let $x_1,\ldots,x_n\in\mathbb{R}^n$ be linearly independent and define $\Delta_{(x_1,\ldots,x_n)} := { \sum_{i=1}^n \alpha_i x_i \ |\ \alpha_i\in[0,1],\ \sum_{i=1}^n \alpha_i \le 1 }$. Show that $\Delta^{\circ}{(x_1,\ldots,x_n)} = { \sum{i=1}^n \alpha_i x_i \ |\ \alpha_i>0,\ \sum_{i=1}^n \alpha_i < 1 }$, and, in particular, $\Delta^{\circ}_{(x_1,\ldots,x_n)}\neq\varnothing$.

Solution 2.13*

Roughly (I took this from the Latina looking gal in class):

Also, we use heavy reliance on 2.13.

"$L(V)$ ... oepn where $V$ ... ${(t_i) : |t_i| < \delta }\in \sum \alpha_i^* x_i \in \Delta^{\circ}$"

Main hint: Use properties of invertible linear maps, such that linear maps are continous (consider the pre-image sense of the term) One needs to take care/work out where linear independence is used.

Exercise 2.14*

(Note: This is not the same as 2.14 in the script, but in this case we need to reference it to a good extent)

Let $H\subset\mathbb{R}^n$ be a hyperplane. Show that for every subset $M\subset H$ one has $M^{\circ}=\varnothing$ and $\partial M = \partial,\overline{M}$.

Solution 2.14*

Roughly (I took this from the Latina looking gal in class):

"$C^{\circ}=\varnothing$ implies doesn't have n linear independent points.

${\bar C}\subset M$

$C^{\circ} = ({\bar C})^{\circ} =\varnothing$

$dim(aff C) \le n - 1$ "

Note: For non-empty C, we have $aff(C) = z + span(C-z)$

The relevant content starts at page 17 in the script. The chapter ends on page 25. Note that the exercises are different ones from the listing in the script.

Exercise's context=

We consider the general minimization problem $p^{\ast} := \inf_{x \in M \subseteq X} f(x),$ where $X \subseteq \mathbb{R}^n$, $f:X\to\mathbb{R}$ is continuous, and the feasible set $M$ is given by inequality and equality constraints in terms of continuous functions $(g_i){i=1,\dots,m}$ and $(h_j){j=1,\dots,q}$ (see lecture/notes).

The Lagrangian is defined as $L(x;u,v) := f(x) + \sum_{i=1}^m u_i, g_i(x) + \sum_{j=1}^q v_j, h_j(x),$ with $u \in \mathbb{R}^m_{+}$ and $v \in \mathbb{R}^q$.

The associated Lagrange dual function is $\Theta:\mathbb{R}^m_{+}\times \mathbb{R}^q \to \mathbb{R},$ $\Theta(u,v) := \inf_{x \in X} L(x;u,v),$ and the dual problem is $d^{\ast} := \sup_{u \in \mathbb{R}^m_{+},\ v \in \mathbb{R}^q} \Theta(u,v).$

Exercise 5.1 (3.a)

(i) Show that $L(x;u,v) \le f(x)$ for all $x \in M$, $u \in \mathbb{R}^m_{+}$, $v \in \mathbb{R}^q$. (ii) Show that $d^{\ast} \le p^{\ast}$.

Solution 5.1 (3.a)

(i)

$g_i \le 0$ $h_j = 0$ $u \in {\mathbb R}^m$ $v \in {\mathbb R}^q$

Let $x_0\in M$ $L(x_i; u, v) = \text{look at the signs of all products in the sum} \le f(x_0)$

(ii) Roughly, $p^*$ in in-between $L(x_i; u, v) \le f(x_0)$ and $\Theta(u,v)$ is on the left of that, regardless of the sup over $u,v$. One just needs to properly keep track of which $M$ is being talked about.

Note: The proven inequality doesn't necessarily hold for $x_0\notin M$

Exercise 5.2 (3.b)

Let $X=\mathbb{R}^2$, $f(x)=-|x|^2$, and the feasible set $M={,x \in X \mid g(x)\le 0,}$ with $g(x)=|x|-1$.

(i) Determine the primal value $p^{\ast}=\inf_{x \in M} f(x)$. (ii) Determine the Lagrangian $L(x;u)=f(x)+u,g(x)$ for $u \in \mathbb{R}{+}$ and the dual function $\Theta(u)=\inf{x \in X} L(x;u)$. (iii) Show that $d^{\ast}=\sup_{u \ge 0}\Theta(u)=-\infty$ and hence that a duality gap is present.

Solution 5.2 (3.b)

Use $r_x := \lVert x\rVert > 0$ and for plots view it as the value on an x-axis.

Recall that norms have positive definiteness, $\lVert x\rVert = 0 \Leftrightarrow x = 0$.

(i)

$r\mapsto -r_x^2$ is $-1$ at $1$

(ii)

$L(x; u) = -r_x^2 + u\cdot r_x^2 - u$

(iii)

The whole non-positive half of the real is in the range of $L$ w.r.t. $x$. So we have $\inf_x L(x; u) = -\infty$

And $d^\ast$ here can't exceed this $-\infty$.

Exercise 5.3 (3.c)

We change only the boundary formulation, but keep $X=\mathbb{R}^2$ and $f(x)=-|x|^2$. Let $M={,x \in X \mid \tilde g(x)\le 0,}$ with $\tilde g(x)=|x|^2-1$.

(i) Determine again the primal value $p^{\ast}=\inf_{x \in M} f(x)$. (ii) Determine the Lagrangian $\tilde L(x;u)=f(x)+u,\tilde g(x)$ for $u \in \mathbb{R}{+}$ and the dual function $\tilde \Theta(u)=\inf{x \in X} \tilde L(x;u)$. (iii) Determine $d^{\ast}=\sup_{u \ge 0}\tilde \Theta(u)$ and show that strong duality holds, i.e., $d^{\ast}=p^{\ast}$. Give an optimal multiplier $u^{\ast}$.

Solution 5.3 (3.d)

(i, ii, iii)

$d^\ast = -\infty < p^\ast$

$\Theta(u)$, the inf of $L$, is $-u$ for $u\ge 1$ and $-\infty$ for $0\le u< 1$.

$d^\ast = -1 = p^\ast$

Exercise 5.4 (3.d)

Visualize, for both examples (1.b and 1.c), the Lagrangian functions for various values of the Lagrange multiplier $u$, in particular for the (if it exists) optimal parameter $u^{\ast}$.

Hint: Reduce the problem to a one-dimensional analysis.

Solution 5.4 (3.d)

Get two curves for the Lagrange potentials, and then some Kurvenschaars.

(1.b) We get something like $-r^2$ as well as some pudding forms below the x-axis, which is also $f$

(1.c) Flat for $u=1$ and else some some intersection (in the center, $L$ is below the other curve).

Exercise 6.1 (3.e)

Show: For every convex function $h:\mathbb{R}^d\to\mathbb{R}$ there exist an index set $I$, real numbers $(\alpha_i){i\in I}\subset\mathbb{R}$, and vectors $(b_i){i\in I}\subset\mathbb{R}^d$ such that $h(x)=\sup_{i\in I}(\alpha_i+\langle b_i,x\rangle)$ for all $x\in\mathbb{R}^d$, i.e., $h$ is the pointwise supremum of affine functionals. (This is the converse statement to Theorem 3.1 from the lecture.)

Also present the connection to convex sets and supporting hyperplanes.

Hint: A function $h:\mathbb{R}^d\to\mathbb{R}$ is convex iff its epigraph $\mathrm{epi}(h):={(x,t)\in\mathbb{R}^d\times\mathbb{R}\mid t\ge h(x)}$ is a convex set. Use supporting hyperplanes of the epigraph in $\mathbb{R}^{d+1}$ to obtain the representation as a supremum of affine functionals. Refer explicitly to the separation theorems and to the geometry of the epigraph.

Solution 6.1 (3.e)

epi(h) := {(r, x) in R x R^d | h(x) <= r}

forall x in R^d z_x := (h(x), x) via: approx Rand mit (h(x)-1/n, x) in R^d\epi(h)

Nach satz in 2. Kapitel existiert eine Hauptebene: H_{V_x, g_x} mit forall z in epi(h). < v_x, z_x > = g_x <= < v_x, z >

Definiere die affine Funktion l_x : R^d -> r turns out to be l_x(y) := g_x + < b_x, y > which has

forall y in R^d. l_x(y) <= h(y) l_x(x) = h(x)

und somit folgt h(x) = sup_{y in R^d} l_y(x)

Exercise 6.2 (3.2)

Prove Theorems 3.3 and 3.4 from the script.

Solution 6.2 (3.2)

tbd.

Exercise 7.0 (Setup)

Let $f, g_1, \ldots, g_m, h_1, \ldots, h_q : X \to \mathbb R$ ($X \subseteq \mathbb R^n$) be continuously differentiable and $M := { x \in X \mid g_i(x) \le 0 \ \forall i = 1, \ldots, m,\ h_j(x) = 0 \ \forall j = 1, \ldots, q }$.

Definition. Given the Lagrange function

$L(x,u,v) := f(x) + \langle u, g(x) \rangle + \langle v, h(x) \rangle$,

we call $(x,(u,v)) \in X \times (\mathbb R^m_+ \times \mathbb R^q)$ a KKT pair if:

(a) $\nabla_x L(x;u,v) = 0$; (b) $\nabla_u L(x;u,v) \le 0$ and $\langle u, \nabla_u L(x;u,v) \rangle = 0$; (c) $\nabla_v L(x;u,v) = 0$.

[[https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions]]

Theorem 3.10*

Let $x^{\ast} \in M$ be a local feasible minimizer of $f$, that is, there exists an open set $O \subset X$ with $f(x^{\ast}) = \min_{x \in O \cap M} f(x)$.

Furthermore, suppose that at $x^{\ast}$ the constraint qualifications (CQ) are satisfied, that is, the vectors

${ \nabla g_i(x^{\ast}) \mid i = 1, \ldots, m,\ g_i(x^{\ast}) = 0 } \cup { \nabla h_j(x^{\ast}) \mid j = 1, \ldots, q }$

are linearly independent. Then there exist Lagrange multipliers $(u^{\ast}, v^{\ast}) \in \mathbb R^m_+ \times \mathbb R^q$ such that $(x^{\ast},(u^{\ast},v^{\ast}))$ is a KKT pair.

In the first exercise we prove pieces of that theorem.

Exercise 7.1 (3.f)

Note: This exercise is the proof of Theorem 3.10* from the script.

Let $x^{\ast} \in M$ be a local feasible minimizer for which CQ holds. After renumbering the constraints we assume $g_i(x^{\ast}) = 0$ for $i = 1,\ldots,m^{\ast}$ and $g_i(x^{\ast}) < 0$ for $i = m^{\ast}+1,\ldots,m$, and define the matrices

$G = (\nabla g_1(x^{\ast}),\ldots,\nabla g_{m^{\ast}}(x^{\ast}))^{T} \in \mathbb R^{m^{\ast} \times n},$ $H = (\nabla h_1(x^{\ast}),\ldots,\nabla h_q(x^{\ast}))^{T} \in \mathbb R^{q \times n}.$

Show:

The inequality system $\langle \nabla f(x^{\ast}), z \rangle < 0,$ $Gz \le 0,$ $Hz = 0,$ $z \in \mathbb R^n$ has no solution.

Hint: You only need to use the optimality of $x^{\ast}$ and CQ.

There exist $(u^{\ast}, v^{\ast}) \in \mathbb R_+^{m} \times \mathbb R^{q}$ such that $\nabla_x L(x^{\ast};u^{\ast},v^{\ast}) = 0$.

Hint: Represent the equality $Hx = 0$ as two inequalities and use a result from the lecture.

Solution 7.1 (3.f)

Part 1

Note: The split of $g_i\le 0$ into $g_j<0, g_k=0$ amounts to a re-ordering and logical rewriting from disjunctions (of formulas) to existential statements (quantifying an index bound).

Interpretation of the 3 formulas

The rewriting means that the matrix $G$ is only concerned with active (see note in course notes) $g$-constraints. The $h$ are active anyway. So, firstly, if no constraints are active, then at the minimum in the open set we got $\nabla f(x^{\ast})=0$ and so the first equality is violated. (See also Theorem 4.1 in the lectures.)

Likewise $z=0$ is not a solution either. So it's ruled out that there are so many $h_j$'s with independent tangets such that ${0}$. Assume $q<n$ below.

If $z$ IS a solution, then $c, z$ for positive scalar $c$ is a solution too, so w.l.o.g. we can assume $|z| = 1$. So (with $\nabla f\cdot z < 0$) this is about no (always in the first-order sense) decrease of $f$ while feasibility ($g, h$) is not broken: In any direction along the constraint surface with $h=0$ (as $\nabla h \cdot z = H\cdot z = 0$, also just a first-order relation), and similarly (but more relaxed) for the active $g$. The more relaxed $g$-equation is related to the "feasible cone".

Argument

$H$ is just the jacobian $Dh(x^)$ of $h$ at $x^$. Under GC, the implicit function theorem holds, giving us a manifold along which $h$ is constant. We really just need a $C^1((-\varepsilon, \varepsilon), {\mathbb R})$ curve $\gamma$ with $\gamma(0)=x^*$ and $z=\gamma'(0)$ is legal, and so, by chain rule, $\nabla f\cdot z = \frac{d}{dt}|_{t=0} f(\gamma(t))$. If thats's $<$, we have re-minimized the point - contradiction.

Regarding the interaction of the implicit functions theorem with our constraints $g, h$: Here one must understand $O(), o()$ notation. Regarding the (non-binding) $g_i$ with slack ($i = m^*+1, \dots, m$):

For $\epsilon$ small enough $g_i(x^* + \epsilon, z) = g_i(x^*) + O(\epsilon) < 0$
For the binding one ($g_i(x^)=0$ sharp) we got a Taylor expansion to first order, also $g_i(x^ + \epsilon, z) = \langle \nabla g_i(x^), z \rangle + o(\epsilon)$ This is $< 0$ IF $Gz = \langle \nabla g_i(x^), z \rangle < 0$ (strict inequality) If $Gz = 0$ (i.e. $z$ actually along the tanget plane to first order), then we use those $\nabla g_i$ also in the implicit function theorem. All $g_k$ which are sharp AND have gradient normal to said constraint must be incoorporated into the implicit function construction.

Part 2

The claim $\nabla_x L = 0$ at $x^*$ for some $u, v$ exactly says that the gradient of $f$ is here linearly dependent of the gradients of $h, g$.

Farkas' Lemma: Let $A\in {\mathbb R}^{m\times n}, b\in {\mathbb R}^{m}, x\in {\mathbb R}^{n}, y\in {\mathbb R}^{m}$ Then one of the following always has a solution:

$x\ge 0\land A x = b$
$b^T y > 0\land A^T y = 0$

Write $Hz = 0$ as $Hz\le 0 \le Hz$, i.e. $Hz\le 0\land -Hz\le 0$. Set $b=-\nabla f, z=y$ and use a 1-dim matrix $A$ made from $G$ and $H$. Then $x$ gives our $u^$ and $v^$.

The other multiplicators can be set $0$ and the constraints are validated.

Exercise 7.2 (3.g)

We consider the optimization problem

$f(x_1,x_2) := (x_1 - 3)^2 + (x_2 - 3)^2$

with the inequality constraints

$g_1(x_1,x_2) := x_1^2 + x_2^2 - 50 \le 0,$ $g_2(x_1,x_2) := x_1 - 1 \le 0.$

Find all KKT pairs.
Are the Constraint Qualifications satisfied at these points?
Are these local/global minimizers?

Solution 7.2 (3.g)

Note: The last question "- Are these local/global minimizers?" is to be interpreted with constraints $g_i\le 0$ taken into account.

Note: I've made a note about the checking of active/inactive constraint in a section of lecture 7 in the course notes.

Find all KKT pairs.

$L = L(x_1,x_2,u_1,u_2) := (x_1 - 3)^2 + (x_2 - 3)^2 + u_1(x_1^2 + x_2^2 - 50) + u_2(x_1 - 1).$

$\partial_{x_i} L = 2(x_i - 3) + 2u_1, x_i + \delta_{i1},u_2,$

$\partial_{u_1} L = x_1^2 + x_2^2 - 50,$

$\partial_{u_2} L = x_1 - 1$

Note: $50 = R^2$ with $ R = \sqrt{2) 5$ or $7.071$.

Cases to go through:

$g_1(x) < 0$, $u_1 = 0$; $g_2(x) < 0$, $u_2 = 0$
$g_1(x) = 0$, $u_1 \ge 0$; $g_2(x) < 0$, $u_2 = 0$
$g_1(x) < 0$, $u_1 = 0$; $g_2(x) = 0$, $u_2 \ge 0$
$g_1(x) = 0$, $u_1 \ge 0$; $g_2(x) = 0$, $u_2 \ge 0$

Case 1 $g_1(x) < 0$, $u_1 = 0$; $g_2(x) < 0$, $u_2 = 0$

$\partial_{x_i} L = 2(x_i - 3) = 0$

So ${\vec x}=3,\langle 1, 1\rangle$, the global minimum of $f$. But this validates both $g_i\le 0$

Case 2 $g_1(x) = 0$, $u_1 \ge 0$; $g_2(x) < 0$, $u_2 = 0$.

$\partial_{x_i} L = 2(x_i - 3) + 2u_1, x_i$ $x_1^2 + x_2^2 = 50$.

So $x_1 = x_2 = \frac{3}{1 + u_1}$

With the circle equation, $\frac{3}{1+u_1} = \pm 5$ which forces $u_1<0$, so this isn't a KKT pair.

Case 3 $g_1(x) < 0$, $u_1 = 0$; $g_2(x) = 0$, $u_2 \ge 0$.

$\partial_{x_i} L = 2(x_i - 3) + \delta_{i1},u_2,$ $x_1 = 1$.

From $\partial_{x_2}L = 0$ we get $x_2 = 3$, fixing ${\vec x} = \langle 1, 3\rangle$. From $\partial_{x_1}L = 0$ we thus get $u_2 = 4$, fixing ${\vec u} = \langle 0, 4\rangle$.

Our point has $|{\vec x}| = 1$, so $g_1({\vec x}) = -40 < 0$.

So $\langle x_1,x_2,u_1,u_2\rangle = \langle 1,3,0,4\rangle$ is valid!

Case 4 $g_1(x) = 0$, $u_1 \ge 0$; $g_2(x) = 0$, $u_2 \ge 0$.

This gives the original $\partial_{x_i} L$ without setting zero of any $u_i$.

$x_1 = 1$ $x_1^2 + x_2^2 = 50$ is solved by ${\vec x} = \langle 1, \pm 7\rangle$. But this in $\partial_{x_2}L = 0$ gives negative $u_1$, so invalid.

In conclusion, Case 3 is the only valid KKT pair.

Are the Constraint Qualifications satisfied at these points?

For GC, we only need to check the binding constraints, so $g_2$ here. A single vector is always linearly independent.

In the exercise, actually the gradients are linear indepdent as long as $x_2$ isn't zero.

$\nabla g_1(x_1,x_2) = \langle 2x_1,; 2x_2 \rangle^\top$

$\nabla g_2(x_1,x_2) = \langle 1,; 0 \rangle^\top$

Are these local/global minimizers?

W.r.t. $f$, $x_1=1$ is always better than $x_1 < 0$. The function $f(x_1=1, x_2)$ is quadratic in $x_2$, minimized by our point.

Different argument: Here $f$ is convex, as is the feasible set $M$ (given by the volumes $g_i\le 0$). I.e. all ingredients here are convex. For a single valid KKT pair, this means it must be global.

Exercise 8.0 (Theorems from the script)

Theorem 4.1 (a). If $\sigma:\mathbb{R}\to\mathbb{R}$ is convex, then the difference quotients $\dfrac{\sigma(t)-\sigma(0)}{t}$ are nondecreasing for $t>0$.

Theorem 4.1 (b). Suppose $\psi:\mathbb{R}^d\to\mathbb{R}$ is a smooth convex function. Then any critical point $x^{\ast}$ (i.e., $\nabla\psi(x^{\ast})=\mathbf 0$) is a global minimizer of $\psi$: for all $x\in\mathbb{R}^d$, $\psi(x^{\ast})\le \psi(x)$.

For subgradients, see [[https://en.wikipedia.org/wiki/Subderivative]]

A helping lemma:

if $f:{\mathbb R}\to{\mathbb R}$ is convex and non-decreasting, and $g$ is convex, then $f\circ g$ is convex.
if $f:{\mathbb R}\to{\mathbb R}$ is convex and non-increasing, and $g$ is concave, then $f\circ g$ is convex.
if $f:{\mathbb R}\to{\mathbb R}$ is convex, and $g$ is affine linear, then $f\circ g$ is convex.

Another helping lemma: $a \le b \le a \iff a=b$ and so also $\left(\forall c. c,a \le c,b\right) \iff a=b$

Exercise 8.1 (4.a)

Let $\psi:\mathbb{R}^d\to\mathbb{R}$ be convex (not necessarily smooth).

Using Theorem 4.1(a), show that for all $x\in\mathbb{R}^d$ and any direction $d\in\mathbb{R}^d\setminus{0}$ the limit $\psi'd(x):=\lim{t\downarrow 0}\dfrac{\psi(x+t,d)-\psi(x)}{t}=\inf_{t>0}\dfrac{\psi(x+t,d)-\psi(x)}{t}$ exists.

The value $\psi'_d(x)$ is called the (one-sided) directional derivative of $\psi$ at $x$ in direction $d$.

Solution 8.1 (4.a)

Fix $x\in\mathbb{R}^d$ and $d\ne 0$. Define the one-dimensional slice $\sigma:[0,\infty)\to\mathbb{R}$ by $\sigma(t):=\psi(x+t d)$.

Since $\psi$ is convex and $t\mapsto x+t d$ is affine, $\sigma$ is convex on $[0,\infty)$.

The difference quotient $\dfrac{\psi(x+t d)-\psi(x)}{t}$ is thus of the form $=\dfrac{\sigma(t)-\sigma(0)}{t}$, i.e. as in theorem 4.1(a), and nondecreasing for $t>0$.

By the standard real-analysis fact that a monotone function has a one-sided limit at every point, the limit $\lim_{t\downarrow 0}$ of it exists (possibly $+\infty$) and equals the inf. [[https://en.wikipedia.org/wiki/Monotonic_function#Some_basic_applications_and_results]]

Note: E.g. non-convex function like $\sin(\tfrac{1}{x})$ violates the fact.

Exercise 8.2 (4.b)

Definition Let $\psi:\mathbb{R}^d\to\mathbb{R}$ be convex and let $\psi'_d(x)$ denote the (one-sided) directional derivative of $\psi$ at $x$ in direction $d$, as in Exercise 4.a.

A vector $g\in\mathbb{R}^d$ is called a subgradient of $\psi$ at $x$ if $\langle g,d\rangle\le \psi'_d(x)$ for all $d\in\mathbb{R}^d$.

The set of all subgradients of $\psi$ at $x$ is the subdifferential of $\psi$ at $x$, denoted $\partial\psi(x):={,g\in\mathbb{R}^d \mid \langle g,d\rangle\le \psi'_d(x)\text{ for all }d\in\mathbb{R}^d\setminus{0},}$.

Illustrate the definition of the subgradient using the example $\psi(x)=|x|$.

Furthermore, show that if $\psi$ is continuously differentiable, then for all $x\in\mathbb{R}^d$ we have $\partial\psi(x)={\nabla\psi(x)}$.

Solution 8.2 (4.b)

Solution 8.2.a (4.b.a)

Example $\psi(x)=|x|$ on $\mathbb{R}$. Draw a graph, with another axis $t$ below the x-axis. Consider that different $d$ just scales how fast you move through this.

For $x>0$, one has $|x+t d|-|x|=(x+t d)-x=t d$, hence Note: To compute the derivative, we can restrict ourselves to $t$ so small that $x+t d$ lies in a ball wholy on the positive side. $\psi'd(x)=\lim{t\downarrow 0}\dfrac{|x+t d|-|x|}{t}=d$. The subgradient inequality $\langle g,d\rangle\le \psi'_d(x)=d$ for all $d\in\mathbb{R}$. I.e. $g\cdot d \le d$ for both positive and negative $d$. But then $1\le g\cdot 1 \le 1$. This forces $g=1$, so $\partial|x|(x)={1}$. (The same kind of squeezing argument will be used in Solution 8.2.b (4.b.a))

For $x<0$, similarly $|x+t d|-|x|=-(x+t d)+x=-t d$, hence $\psi'_d(x)=-d$ and the inequality forces $g=-1$, so $\partial|x|(x)={-1}$.

At $x=0$, for $t>0$ we have $|t d|-|0|=t|d|$, so $\psi'_d(0)=|d|$. The condition $\langle g,d\rangle\le |d|$ for all $d\in\mathbb{R}$ holds iff $g\in[-1,1]$. Thus $\partial|x|(0)=[-1,1]$.

At $x=0$, those values $h\in [-1,1]$ can be visualizes as the selection of vectors (going left or right by $1$ and there up/down by $h$.) More generally, for the function $|{\vec x}|$ with ${\vec x}\in{\mathbb R}^n$, we have $\partial|x|({\vec 0})$ is a unit disc.

Note: For proper, lower-semicontinuous convex $f,g$, the set $\partial(f+g)(x)$ is a subset of the set of all pairwise additions of vectors from the sets $\partial f(x)$ together with those of $\partial g(x)$. With more regularity conditions, we even get equality.

Solution 8.2.b (4.b.a)

Now suppose $\psi$ is continuously differentiable on $\mathbb{R}^d$. Then even Taylor expansions of $\psi(x+t,d)$ exists. In particular, by the directional chain rule from elementary analysis, $\psi'_d(x)=\langle\nabla\psi(x),d\rangle$ for all $x,d$.

If $g\in\partial\psi(x)$, then for all $d$, $\langle g,d\rangle\le \langle\nabla\psi(x),d\rangle$. Applying the same with $-d$ gives $-\langle g,d\rangle\le -\langle\nabla\psi(x),d\rangle$, hence equality for all $d$, so $g=\nabla\psi(x)$. Therefore $\partial\psi(x)={\nabla\psi(x)}$.

Exercise 8.3 (4.c)

Exercise 8.3.a (4.c)

Let $\psi:\mathbb{R}^d\to\mathbb{R}$ be convex and assume that $x^{\ast}\in\mathbb{R}^d$ satisfies $\mathbf{0}\in\partial\psi(x^{\ast})$.

Show that $x^{\ast}$ is a global minimizer of $\psi$ over $\mathbb{R}^d$.

Exercise 8.3.b (4.c)

For constrained minimization problems: Does the necessary Karush–Kuhn–Tucker criterion generalize to the subdifferential setting?

Solution 8.3 (4.c)

Solution 8.3.a (4.c)

For any $x\in\mathbb{R}^d$ we let $d:=x-x^{\ast}$ and so $\sigma(t):=\psi(x^{\ast}+t (x-x^{\ast}))$ for $t\ge 0$. We just need to show $\psi(x) = \sigma(1)\ge \sigma(0) = \psi(x^{\ast})$.

As per Theorem 4.1(a), the one-sided derivative exists and equals its inf, $\psi'_d(x^{\ast})$. The assumption $\mathbf{0}\in\partial\psi(x^{\ast})$ means $0 = \langle \mathbf{0},d\rangle\le \psi'_d(x^{\ast})$. The convex function can't fall below the line with that slope. (Draw a picture and a line.)

Solution 8.3.b (4.c)

Yes, there's at least a StackExchange question [[https://math.stackexchange.com/questions/3680509/how-to-use-the-kkt-conditions-for-a-not-differentiable-function-using-subdiffere]]

From ChatGPT: in the convex setting, the KKT stationarity condition extends by replacing gradients with subdifferentials and adding the normal cone of the feasible set. Concretely, for inequality constraints $g_i$ and affine equalities $A x=b$, one has $0\in \partial\psi(x^{\ast})+\sum_i \lambda_i^{\ast}\partial g_i(x^{\ast})+A^\top \nu^{\ast}+N_{X}(x^{\ast})$, together with feasibility, $\lambda^{\ast}\ge 0$, and complementary slackness. Under a Slater-type constraint qualification, these subdifferential KKT conditions are necessary and sufficient for optimality in convex problems.

Exercise 9.1 (4.d)

Least squares regression is given by the minimization problem $\min_{\beta\in\mathbb{R}^n} S(\beta)$ $S(\beta):=\sum_{i=1}^{M}(\langle \beta,{\bf x}_i\rangle - y_i)^2,$ where ${\bf x}_1,\dots,{\bf x}_M\in\mathbb{R}^n$ and $y_1,\dots, y_M\in\mathbb{R}$ are given observations.

Formulate this problem in matrix form as an unconstrained quadratic optimization problem, as introduced in the lecture. Justify why the problem is convex, why it always has a (possibly nonunique) solution, and give an explicit representation of a solution.

Solution 9.1 (4.d)

The formulation should look like

$S(\beta) = \tfrac{1}{2} {\bf \beta}^T {\bf Q} {\bf \beta} + c^T {\bf \beta} + d = \left(\tfrac{1}{2} {\bf \beta}^T {\bf Q} + c^T\right) {\bf \beta} + d$

For the example, we denote the $i$'s data vector by ${\bf x}_i$. Per ${\bf x}_i$, its components can be of mixed data-units (dimension). But that sequence of data-units is the same for each $i$. The standard notation for the design matrix has the ${\bf x}_i$ as $M$ row vectors, ${\bf X} = \langle{\bf x}_1, \dots, {\bf x}_M\rangle^T$. So all rows of ${\bf X}^T = \langle{\bf x}_1, \dots, {\bf x}_M\rangle$ are comprised of only on data-unit. All $M$ values $y_i$ share data-units and so ${\bf X}^T y$ are $n$ vectors. We denote the corresponding $i$'s (scalar) response by $y_i$.

Here now, $d=y^T y$. Our matrix is then a weighted sum of rank-1 symmetric projector

${\bf Q} = 2 \sum_{i=1}^{M} {\bf x}_i \otimes {\bf x}_i^T = 2, {\bf X}^T {\bf X} := 2, G$

(here $G$ is the Gram matrix for the $x$'s.)

${\bf Q} = 2 \sum_{i=1}^{M} |{\bf x}i|^2 P_i$ with $P_i := u{x_i} \otimes u_{x_i}$

(The $P_i$ are sometimes called "orthogonal" projector, due to the symmetry, but the $u_{x_i}$ aren't necessarily mutually orthogonal.)

${\bf c}^T = -2\sum_{i=1}^{M} y_i, {\bf x}_i^T = -2{\bf X}y$

If ${\bf x}i^T {\bf Q} {\bf x}i$ span the space $\mathbb{R}^n$, we can use $P_i={\bf e}i \otimes {\bf e}i^T$ with $\sum{i=1}^{n} P_i={\mathsf 1}{n\times n}$ and compute its expansion ${\bf Q}=\sum{i=1}^{n}\sum{j=1}^{n}P_i{\bf Q}P_j$ and it will have quadratic coefficients, i.e. be positive semidefinite. TODO: State exact statement about ranks.

TODO: Explicate why anti-symmetric part of the form isn't relevant for our cost function here.

[[https://en.wikipedia.org/wiki/Matrix_calculus#Identities]] $Q_\text{sym} := \tfrac{1}{2}(Q + Q^\top)$ ${\mathrm d}S = \left(Q_\text{sym}\beta + c\right)^T{\mathrm d}\beta =: (\nabla_\beta S)^T,{\mathrm d}\beta$ A stationary point $\beta^*$ has $\nabla_\beta S = 0$. Here that's ${\bf X}^T({\bf X}\beta - y) = 0$.

Reminder: If a function is twice differentiable on a convex open set, then it's convex iff it's Hessian is postive semidefinite everywhere on that set. (Note: There's also notions of "$m$-strongly convex" and non-differentiable variants of such characterizations involving subgradients.)

It being convex and bounded form below, the optimization problem has a solution.

We further got $\nabla^2 S = Q$. And here, $\nabla^2 S$ is symmetric and so, famously, the solution is $X^+ y$ where if ${\bf X}^T{\bf X}$ is inversible we have $({\bf X}^T{\bf X})^{-1}{\bf X}^T$.

For completeness, $\dfrac{\partial}{\partial \beta_k}S(\beta) = 2\sum_{i=1}^{M}(\langle \beta, {\bf x}_i\rangle - y_i) ({\bf x}_i)k$ $\dfrac{\partial}{\partial \beta_j}\dfrac{\partial}{\partial \beta_k}S(\beta) = 2\sum{i=1}^{M}({\bf x}_i)_j\cdot ({\bf x}_i)_k$

...

$S(\beta + \alpha) - S(\beta) = \alpha^\top(Q_\text{sym}\beta + c \tfrac{1}{2},Q \alpha)$ At the stationary point $\beta^\ast$, this gives $S(\beta + \alpha) - S(\beta) = \tfrac{1}{2}\alpha^T Q \alpha$. Here $ |X\alpha|^2$ So $\alpha$ in the kernel of $X$ gives another solution. Note: There's also an abstract argument regardign $X^Ty$ in the image of $X^TX$.)

Exercise 9.2 (4.e)

Ridge regression provides a regularized version of least squares in strongly correlated settings. It is given by the minimization problem $\min_{\beta\in\mathbb{R}^n}(S(\beta)+\lambda|\beta|^2), \tag{1}$ where $\lambda\ge 0$ is the regularization parameter and $|\cdot|$ denotes the Euclidean norm.

Formulate this problem as a quadratic optimization problem in matrix form.
Solve the minimization problem. Is the minimizer unique?

Now also consider the constrained optimization problem $\min_{\beta\in\mathbb{R}^n} S(\beta), \qquad |\beta|\le t, \tag{2}$ for a constraint parameter $t>0$.

Note: Here the lecturer actually intended to write $|\beta|^2$ but accidentally dropped the exponent $2$. The solution below solves the problem as stated, i.e. without the squaring, which makes it a bit more compilicated (but still simpler than the final.)

State the KKT conditions by computing the gradients explicitly. Are the KKT conditions sufficient for optimality here?

Finally, study the relation between the penalized problem (1) and the constrained problem (2).

Show that every KKT pair of (2) corresponds to a solution of (1) for a suitable choice of $\lambda$.
Show the converse: for any fixed $\lambda\ge 0$ one can choose a constraint parameter $t\ge 0$ such that every solution of (2) yields a solution of (1).
Assume that in (2) the constraint is active at the minimizer $\beta^{\ast}$, i.e. $|\beta^{\ast}|=t$. Use the KKT conditions to derive a relation between the constraint parameter $t$ and the associated Lagrange multiplier $\lambda^{\ast}$.

Solution 9.2 (4.e)

The symmetric quadratic form ${\bf Q}$ above is here generalized to ${\bf Q}+\lambda,D$ where $D$ is for example the identity. Note: Only "for example", due to regression generally using not just a linear but an affine-linear model, which is done by introducing constant $1$'s at the start of the data vectors. To obtain better translation-of-$y$ and bias properties, one then uses $D := \mathrm{diag}(0, {\mathsf 1}_{n\times n})$, so that the intercept "$\beta_0$" doesn't end up being penalized.

For $\lambda\neq 0$, the new form is also positive definite. Said differently, the kernel ${\bf Q}+D$ and $S(\beta + \alpha) - S(\beta)$ above is the zero vector.

The previous exercise gets generalized as follows: $S_\lambda(\beta) = \tfrac{1}{2},{\bf \beta}^T!({\bf Q}+2\lambda D){\bf \beta} + {\bf c}^T {\bf \beta} + d.$ ${\mathrm d}S_\lambda = (Q_{\text{sym}},\beta + {\bf c} + 2\lambda D,\beta)^T{\mathrm d}\beta =:(\nabla_\beta S_\lambda(\beta))^T{\mathrm d}\beta.$ $(Q_{\text{sym}}+2\lambda D),\beta^\ast + {\bf c} = 0.$ $({\bf X}^T{\bf X} + \lambda D),\beta^\ast = {\bf X}^T y.$

$S_\lambda(\beta + \alpha) - S_\lambda(\beta) = \alpha^T!(Q_{\text{sym}},\beta + {\bf c} + 2\lambda D,\beta)

\tfrac{1}{2}\alpha^T{\bf Q},\alpha + \lambda,\alpha^T D,\alpha.$ $S_\lambda(\beta^\ast + \alpha) - S_\lambda(\beta^\ast) = |{\bf X}\alpha|^2 + \lambda,\alpha^T D,\alpha.$ For $D$ the unit matrix, the last term is $|\alpha|^2$.

The $\lambda$-offset shifts the eigenvalues of ${\bf X}^T{\bf X} + \lambda D$ by $\lambda$. For positive lambda the inverse exists.

$\mathcal L(\beta,u)=S(\beta)+u(|\beta|-t)$

Complementary slackness: $u^{\ast}(|\beta^{\ast}|-t)=0$ Note that for those quadratic forms, when $u^{\ast}$ so that $|\beta^{\ast}|-t=0$, then (I think) at the stationary value $t$ always equals the norm of some vector for which the algebraic form is known.
For active $u>0$ we need $|\beta| = t$ and else (primal feasibility) $|\beta| \le t$
Stationarity: We already know $\nabla S(\beta) = Q_{\mathrm{sym}}\beta+c$ and the subgradient $\partial |\beta|$ equals $\frac{\beta}{|\beta|}$ for $\beta\neq 0$, a 1-dim disk for $\beta=0$. In other words: $\beta\neq 0$ means $\nabla S(\beta^\ast) = -u^\ast \frac{\beta}{|\beta|}$ while $\beta=0$ means (can be written as, as also given in the third exercise) $|\nabla S(\beta^\ast)|\le u^\ast$

So we get the equation $(X^T X + \frac{u^\ast}{|\beta^\ast|} {\mathsf 1}_{n\times n}) \beta^\ast = X^T y$ which can (actually always) be inverted. With $r^\ast := X\beta^\ast - y$ (warning: the residual is sometimes taken to be $y - X\beta^\ast$), this reads $X^T r^\ast = -\frac{u^\ast}{|\beta^\ast|} \beta^\ast$

For the interiour case $(|\beta| < t)\to u^\ast = 0$, this is just standard regression. For the boundary case $(|\beta| = t)\to u^\ast > 0$, this is like $\lambda = \frac{u^\ast}{t}$.

? I assume this is not automatically possible to be fulfilled - else we could choose $\beta$ in regression arbitrarily small. (Or maybe we can and this just gives a bad residual?)

Note: If the problem is $|\beta|^2\le t$ instead, the $1/t$ factor doesn't pop up.

Exercise 9.3 (4.f)

Lasso regression is usually formulated directly as a constrained optimization problem: $\min_{\beta\in\mathbb{R}^n}\ \sum_{i=1}^{M}(\langle \beta,x_i\rangle - y_i)^2 \quad \text{subject to} \quad |\beta|_1\le t,$ where $|\beta|_1:=|\beta_1|+\cdots+|\beta_n|$ for a constraint parameter $t>0$.

Are the KKT conditions sufficient for optimality here? Note that this is a nonsmooth problem.
Let $(\beta^{\ast},\lambda^{\ast})$ be a KKT pair. Denote by $\partial|x|$ the subdifferential of the absolute value for $x\in\mathbb{R}$.

Show that for each component $j=1,\dots,n$ there exists $s_j\in\partial|\beta_j^{\ast}|$ with $\partial_j S(\beta^{\ast})+\lambda^{\ast}s_j=0.$ Deduce the following coordinatewise conditions: $\beta_j^{\ast}\ne 0 \ \Rightarrow\ \partial_j S(\beta^{\ast})=-\lambda^{\ast},\mathrm{sign}(\beta_j^{\ast}),$ $\beta_j^{\ast}=0 \ \Rightarrow\ |\partial_j S(\beta^{\ast})|\le \lambda^{\ast}.$

Explain why these conditions suggest that many components of $\beta^{\ast}$ are exactly zero.

Solution 9.3 (4.g)

Description of the subgradient $\partial |\beta|_1$, the elements of which we'll call: The result is always (a collection of) vectors $s$ with as many components as $\beta$. If we evaluate the garient at a place where $\beta_k$ is positive, the component $s_i$ of the gradient is fixed to $+1$. If negative, then it's $-1$. Where a component of $\beta$ is zero, we get $s_k$ in the 1-dim unit disk, i.e. the interval $[-1,+1]$.

?Show that for each component... Isn't that just the stationarity condition?

Compared to the previous exercise, we now have $X^T r^\ast = - u^\ast \cdot s$

As for the last question, apprently $X^T r^\ast$ after recentering can be identified with $|M\cdot\mathrm{Cov}(x_i, r^\ast)|$. If $X^T r^\ast$ is smaller than $u^\ast$, then $\beta_i$ is forced to be zero.

(And the literature says something about "soft-thresholding" solution here.)

Exercise 10.1 (5.a)

We consider a convex minimization problem ($f, g_1, \dots, g_m$ convex, $h(x) = b - Ax$).
Denote by $L$ the associated Lagrange function.
Using the subdifferential, KKT pairs are defined as $(x, (u, v)) \in X \times (\mathbb R_+^m \times \mathbb R^q)$ with:

(a) $0 \in \partial_x L(x; u, v)$.
(b) $\nabla_u L(x; u, v) \le 0$ and $\langle u, \nabla_u L(x; u, v)\rangle = 0$.
(c) $\nabla_v L(x; u, v) = 0$.

Show, based on the results of the lecture and the last exercises:

For every KKT pair $(x^{\ast}, (u^{\ast}, v^{\ast}))$ the point $x^{\ast}$ is a minimizer of the primal problem.
If in addition the Slater condition holds, then for every minimizer $x^{\ast}$ there exists a pair $(u^{\ast}, v^{\ast})$ such that $(x^{\ast}, (u^{\ast}, v^{\ast}))$ is a KKT pair.
Moreover, $(u^{\ast}, v^{\ast})$ is a solution of the dual problem.

Solution 10.1 (5.a)

Here we have equality $f(x^\ast) = L(x^\ast, u^\ast, v^\ast) = \inf_{x\in{\mathbb R}^n} L(x, u^\ast, v^\ast)$

and with convex $L$.

We had the theorem (3.3): If $x^\ast, u^\ast, v^\ast$ is saddle point, then from convexity $p^\ast = f(x^\ast) = \Theta(u^\ast, v^\ast) = d^\ast$

Slater condition $\exists {\hat x}\in M. \forall (1\le i\le m). g_i({\hat x}) < 0$ If this point has $f(x^\ast) = p^\ast$ then with Thm. 4.7 that $(u^\ast v^\ast)\in {\mathbb R}+^m \times {\mathbb R}^q$ such that $p^\ast = f(x^\ast) = \Theta(u^\ast, v^\ast) = d^\ast$ implies $\inf{x\in{\mathbb R}^n} L(x, u^\ast, v^\ast)$

$p^\ast = f(x^\ast) = \Theta(u^\ast, v^\ast) = d^\ast$ also implies that the inner products in $L(x, u^\ast, v^\ast)$ contains zero. (complimentary slackness, a.k.a. "komplimentärer Schlupf")

$\nabla_x L(x^\ast, u^\ast, v^\ast) = 0$ implies $x^\ast, u^\ast, v^\ast$ is a KKT pair.

$x^\ast = \mathrm{argmin}_{x\in R^n} \Psi(x)$ implies difference quotient at $x^\ast$ is $\ge 0$ for all $h$ implies for $h\to 0$ we have $\Psi'(x^\ast) = 0$

$g$ in the subquotient is equivalent to $\langle g, d\rangle \le \Psi_d'(x)$ for all non-zero directions.

Exercise 10.2 (5.b)

We consider a linear minimization problem

$p^{\ast} = \inf_{Ax \le b,\ x \in \mathbb R^n} \langle c, x\rangle$,

where $A \in \mathbb R^{m \times n}$, $b \in \mathbb R^m$ and $c \in \mathbb R^n$.
Let $M := { x \in \mathbb R^n \mid Ax \le b }$ be the feasible set.

Explain why $M$ is a convex polyhedron in $\mathbb R^n$.
Show further that, if the minimum $p^{\ast}$ is attained, the set of optimal solutions is the intersection of $M$ with a supporting hyperplane.
Argue further that therefore a minimizer can always be found at a vertex of the polyhedron.
Formulate an equivalent optimization problem in the standard form of linear programming

$\min{ \tilde c^{\top} z \mid \tilde A z = \tilde b,\ z \ge 0 }$,

and specify the corresponding data $\tilde A$, $\tilde b$, $\tilde c$ in a suitable block notation.
Show in particular the equivalence of the two problems.

Write down the KKT conditions for the problem in standard form.
Let $(z^{\ast}, (\lambda^{\ast}, \mu^{\ast}))$ be a KKT pair for the problem in standard form and denote by $(\tilde a_i)_{i = 1, \dots, k}$ the columns of $\tilde A$ (where $k$ is the dimension of $z$).
Assume that the vectors

${ \tilde a_i \mid z_i^{\ast} > 0,\ i = 1, \dots, k }$

are linearly dependent.
Show that there then exists a shift $d \in \mathbb R^k \setminus {0}$ such that $(z^{\ast} + d, (\lambda^{\ast}, \mu^{\ast}))$ is also a KKT pair.

Using the above observation, justify why it is sufficient to find KKT pairs for which the vectors ${ \tilde a_i \mid z_i^{\ast} > 0,\ i = 1, \dots, k }$ are linearly independent (vertices).
Devise a "brute-force" algorithm for finding all KKT pairs at vertices.

Solution 10.2 (5.b)

TODO(Fix): In some cases below, I have inconsisteny with whether $\langle\rangle$ denote inner prodcut of matrices/vectors. Also, some tilde signs are written with promes.

Note: The exercise silently assumes the planes/subspactes are such that there#s a corner.

[[https://en.wikipedia.org/wiki/Polyhedron]] (Note: We don't demand compactness of a polyhedron, here.) The inequalities each define linear half-spaces. The intersection of such defines a polyhedron.

$x\in M\cap {x \in R^n \mid f(x) = p^} $ $= M\cap{x \in R^n \mid \langle c, x\rangle = p^} = H$

$c$ is normal to $H$.

Since $\langle c, y\rangle = f(y) \ge p^$ for all $y\in M$ also $\langle c, y\rangle \ge p^ = \langle c, x\rangle$ for all $y$ in $M$.

Not proven here: The solution is always in a corner. A corner is a point $x$ such that for all $\lambda\in [0,1]$ and distinct $y,z\in M$ $x=\lambda y + (1-\lambda) z$ implies $\lambda \in {0, 1}$. (Alt: For $\lambda\in (0,1)$, the equation $x=\lambda y + (1-\lambda) z$ is equivalent to $x=y=z$.)

For $M = { x\in R^n \mid Ax\le b }$ the definition is equivalrnt to: $x\in M$ is a corner if $I_x = {i ...}$ (TODO) has $M = { x\in R^n \mid (Ay)_i = b_i, i = 1,\dots, m } = {x}$

Sketch of the rest of the argument: For $x\in S\cap H$ and $x$ not a corner holds $(Ay)_i = b_i$ for all $i\in I_x$ has infinitely many solutions. (We think of solutions along the face $H$ of a polyhedron $M$ here.)

So there exists $d\in R^n$ with $(Ay + d)_i = b_i$ for all $i\in I_x$ $(Ay + d)_i < b_i$ for all $i\notin I_x$

And further $(Ad)_{j^} > 0$ for some $j^\notin I_x$. (Here the assumption comes in that we assume that a corner exists in the first place.)

We can then choose $d$ such that $A(x+d){j^*} = b{j^*}$.

So $I_{x+d} = I_x \cup {j^*}$

Continue the arugment towards a point such that $(Ax)_i = b_i, i\in I_x$ is a unique solution.

${x\in R^n \mid Ax\le b}$ $= {x\in R^n \mid Ax- b\le 0}$ $= {x\in R^n \mid \exists y\in {\mathbb R}+^m. Ax- b + y = 0}$ $= {x_1 - x_2 \in R^n \mid \exists y\in {\mathbb R}+^m. Ax- b + y = 0 \land x_1, x_2 \in R^n_+ }$ The equation in the comprehesnion equals one of stacked matrices and vectors, $( A, -A, I_m )^T ( x_1, x_2, y )$ Similarly, $\langle x, c \rangle = (c, c, 0)^T (x_1, x_2, 0)$

The the following problem is in standard form: $\mathrm{inf}{z\in R^{2n+m}+} (c, c, 0)^T z$ $A' z = b$

The problem is equivalent since $x\in M: z := (x^+, x^-, Ax-b)$ $x^\pm = \max(0, x_i), i=1,\dots, M$ and $\langle c,x\rangle = \langle c',z\rangle$

$z\in M' = R^{2n+m}\cap { z\in R^{2n+m}\mid A'z=b }$ and $z = (x_1, x_2, y)$ for we define $x=x_1-x_2$ and $\langle c,x\rangle = \langle c',z\rangle$

$L(z, (\lambda, \mu)) = \langle c', z\rangle - \langle \lambda, z\rangle - \langle \mu, Ax-b\rangle$

$\nabla_z L(z, (\lambda, \mu)) = c' - \lambda - A^T \mu$

KKZ-condition: $(z, \lambda, \mu) \in R^{2n+m}\times (R^n\times R^m)$ (i) $c' - \lambda - A^T\mu = 0$ (ii) $z \ge 0, A'z = b, \lambda \ge 0$ (iii) $\lambda_i z_i = 0$ for all $i = 1, \dots, 2n+m$

tbd

Exercise 10.3 (5.c)

Consider the model introduced in Exercise 5.b with

$A = \begin{pmatrix} -1 & 0 \ 0 & -1 \ 1 & 1 \end{pmatrix}$ $b = \begin{pmatrix} 0 \ 0 \ 3 \end{pmatrix}$

Apply the items from Exercise 5.b in each case to the problem with

(a) $c = (1, 1)^{\top}$.

(b) $c = (-1, -1)^{\top}$.

In both cases, describe the situation geometrically as far as possible.

Solution 10.3 (5.c)

tbd.

Notes: Applied Optimization

Tmp Notes

In (convex) optimization (finding infimums), roughly, the "dual" comes from it being helpful to look at "dual" problems - some other function we can construct, where we pose another optimization problem (finding supreme) - and then there's some theorems about how those solutions related (e.g. coincide, or bound eachother)
Wolfe duality (p. 29 in the script), see [[https://en.wikipedia.org/wiki/Wolfe_duality]]
- See also [[https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm]]
Slater (p. 31 in the script) ... Next to convexity, "Slater" is a condition that very roughly says there's certain solution points existing and that those aren't on the boundary (<https://en.wikipedia.org/wiki/Slater%27s_condition]]). It has nice consequences w.r.t. duality.
- Note: This is a different Slater than from the Slater determinant in quantum chemistry
TTK ... Karush–Kuhn–Tucker conditions, see [[https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions]]
Note: The script ends up discussing applications of what's being discussed to SVMs [[https://en.wikipedia.org/wiki/Support_vector_machine]]

Lecture 1

Notes and important notions

Convexity (used on R^n)
Standard product on R^n
Product induced norm*, and
Norm induced distance between points
Norm induced distance between a point and a set
Cauchy-Schwarz inequality
Closed subset of R^n
Taking closure preserves convexity
Existence of inf between point and convex set M implies existence of min between two points (with the second in M). (For M not convex, only a sequence in M exists.)
EXERCISE: Prove $| x-y|^2 = | x|^2 + | y|^2 + 2 \langle x, y\rangle$
Minimal distance characterization of a point (for a convex closed subset) in terms of a scalar product
- Uniqueness of this point

Note: If you're in an infinite dimensional space, you can't generally use Boltzano Weierstrass and so existence (?) statements are hintered. What helps, as alternatives: E.g. Hilbert space.

Lecture 2

Notes and important notions

For the exercises after lecture 2, the following is relevant: The context is, again, a convex set $C\subset{\mathbb R}^n$. The exercises concern the projection map $P_C$, defined to map any point $x\in{\mathbb R}$ to the point ${\bar z}_x\in {\bar C}$ on its closure, that is closest to $x$. Theorem 2.1 already proved the existence of a minimal distance projection. Theorem 2.2 concerned uniqueness and angles (see previous exercise). Theorem 2.3 is a variant of Theorem 2.1, except now it's in terms of $\bar C$ and it speaks of the inf of the distances, not the min.

Lecture 3

Notes and important notions

In Theorem 2.6, the hyperplane actually goes through $P_C(x)$, meaning this $\gamma$ actually defines the extremal hyperplane.
Regarding Corollary 2.1, in Linear Algebra this is proven via an argument about a partition of space and theorems about their dimensions.
W.r.t.: (15): Compact plus continuity of dist(x, C) gives us that inf is attained (as minumum)
- TODO: be more clean about this statement
(around p. 10) The script misses the defintion of the interior
Note that the boundary also has a characterization in terms of limit (TODO: Look it up.)
W.r.t. Exercise 2.11: The boundary is all of $[0,1]$ while the boundary of the closure is just ${0,1}$.

Lecture 4

Notes and important notions

Interiour = largest open set that contains all open sets, [[https://en.wikipedia.org/wiki/Interior_%28topology%29]]
Theorem 2.8 rules out that certain situations happen for sets if they are are convex
- Something along the lines of "interior of bound is empty" being shown, or even an equivalence of notions (for convex sets)
The proof also shows ${\mathrm{dim}}({\mathrm{span}}(C)) < n$

Lecture 5

Notes and important notions

We look at theorems such as "Theorem 2.10 (Hyperplane supporting convex set at boundary point)" because we'll later mostly look at linear constraints in optimization problems.

Lecture 6

Notes and important notions

Lecture 7

Notes and important notions

Regarding Theorem 3.6, here an example to think of: $F(x,y) = x\cdot y$. Inf on $x$ makes this $-\infty$ and sup on $y$ gives $\infty$.

The "strong" in strong duality just says that the (non-negative) gap is zero.

In the KKT-conditions, there's a trick with the number of equations going on. In particular, since $g_i\le 0, u_i \ge 0, i\in{1,\dots, n}$, the equations $\sum_i u_i,g_i = {\vec u}\cdot \nabla_u L = 0$ are really the $n$ equations $u_i,g_i=0$.

active constraint: $u_i>0\to g_i=0$, we say $g_i$ is tight, and so $g_i(x)$ constrains values (motion) of $x$.
**strictly inactive ** constraint (converse of the above): $u_i=0$ because $g_i$ takes some value inside the blob, we say $g_i$ has slack, and as such doesn't cut into the open ball around considered $x$. (Note: Both zero is a possible but not so relevant edge case.)

General active–set scheme (KKT with inequality and equality constraints)

We consider a problem with decision variable $x \in \mathbb R^n$, inequality constraints $g_i(x) \le 0$ for $i = 1,\dots,m$ with multipliers $u_i \ge 0$, and equality constraints $h_j(x) = 0$ for $j = 1,\dots,q$ with multipliers $v_j \in \mathbb R$. The Lagrangian is $L(x,u,v) := f(x) + \sum_{i=1}^m u_i g_i(x) + \sum_{j=1}^q v_j h_j(x)$.

The KKT conditions for inequalities are, for each $i$, $g_i(x^{\ast}) \le 0$, $u_i^{\ast} \ge 0$, $u_i^{\ast} g_i(x^{\ast}) = 0$. From these we have the logical implications $g_i(x^{\ast}) < 0 \Rightarrow u_i^{\ast} = 0$, $u_i^{\ast} > 0 \Rightarrow g_i(x^{\ast}) = 0$. We call $g_i$ strictly inactive at $x^{\ast}$ if $g_i(x^{\ast}) < 0$, and active if $g_i(x^{\ast}) = 0$.

To construct KKT candidates via an active–set method, we choose a subset $A \subseteq {1,\dots,m}$ and interpret this choice as the hypothesis: for all $i \in A$ the constraint is active (so we will enforce $g_i(x) = 0$), for all $i \notin A$ the constraint is strictly inactive (so we expect $g_i(x) < 0$ and thus $u_i = 0$).

For a fixed $A$, we therefore seek $(x,u,v)$ satisfying: stationarity in $x$: $\nabla_x L(x,u,v) = 0$, primal equalities: $h_j(x) = 0$ for all $j = 1,\dots,q$, active inequalities as equalities: $g_i(x) = 0$ for all $i \in A$, dual feasibility: $u_i \ge 0$ for all $i$, and we impose $u_i = 0$ for all $i \notin A$ in accordance with the hypothesis “strictly inactive”.

After solving this system, we perform the logical KKT checks: for all $i \notin A$ verify $g_i(x) < 0$ (strict inactivity) and $u_i = 0$, for all $i \in A$ verify $g_i(x) = 0$, $u_i \ge 0$, and $u_i g_i(x) = 0$, for all $i$ verify $g_i(x) \le 0$ and for all $j$ verify $h_j(x) = 0$. Any $(x,u,v)$ that passes these checks is a KKT candidate corresponding to the active–set hypothesis $A$; different choices of $A$ give different candidates, whose objective values $f(x)$ can then be compared.

Note: In going through the cases to find solutions, one can relax some $g < 0$ to $g\le 0$, but this also checks for something else.

Lecture 8 (251119)

Warning: Lecturer acknolwedges that he's using "minimum" (formally a function value) to denote the global minimum position.

Regarding Exercise 3.6

The script doesn't say that there's actually a sufficient second-order condition also, here. (This is because the script is written for convex problems.)

For strict (unique) minimality, one demands positive definiteness of

$H(x^) := \left(\partial_{x_i} \partial_{x_j} L\right)(x^)$

And (more complicatedly) if semi-definiteness in an open patch (!), then one obtains a non-strict (non-unique) minimum. The "in an open patch" part isn't there in lower dimensions.

Regarding Chapter 4

"Convex functions" only makes sense on convex sense.

Lecture 9 (251127)

Regarding Theorem 4.2 with a form $Q$ and vector $c^T$, the OLS case is of this form: One has $Q\beta + c = 0\ \Longleftrightarrow\ X^\top X,\beta = X^\top y.$ to solve $\min_{\beta\in\mathbb{R}^p}\ \tfrac{1}{2},|y - X\beta|_2^2.$

Regarding Theorem 4.3, it seems the assumption that $h$ is of linear form is actually not necessary. It's (I think, and the lecturer conjectures this too) not used in that proof there at all.

Necessary conditions for KKT in convex settings: There's a weaker formulation than $\langle \nabla f d \rangle = 0$ to get the minimum, namely one using the subgradient: ${\vec 0}\in \partial f$. Visually speaking, the min doesn't need to be at a smooth point, it can also be a cusp there.

Lecture 10 (251203)

Recap Thm. 4.7 and 4.8. and final proof of chapter 4.

Lecture 11 (251203)

Note: Non-linear programs aren't discussed in the script.

Lecture 12 (251210)

Lecturer stepped through most of this week's exercise 10.

Nikolaj-K/applied_optimization_2025.md

Exercises - Applied Optimization

Exercise 1.1

Solution 1.1

Exercise 1.2

Solution 1.2

Exercise 2.2

Solution 2.2

Exercise 2.3

Solution 2.3

Exercise 2.4

Solution 2.4

Exercise 2.5

Solution 2.5

Exercise 3.1 (2.6 on p.9)

Solution 3.1 (2.6 on p.9)

Exercise 3.2 (2.7 on p.9)

Solution 3.2 (2.7 on p.9)

Exercise 3.3 (2.8 on p.9)

Solution 3.3 (2.8 on p.9)

Exercise 3.4 (2.9 on p.9)

Solution 3.4 (2.9 on p.9)

Exercise 2.12*

Solution 2.12*

Exercise 2.13*

Solution 2.13*

Exercise 2.14*

Solution 2.14*

Exercise's context=

Exercise 5.1 (3.a)

Solution 5.1 (3.a)

Exercise 5.2 (3.b)

Solution 5.2 (3.b)

Exercise 5.3 (3.c)

Solution 5.3 (3.d)

Exercise 5.4 (3.d)

Solution 5.4 (3.d)

Exercise 6.1 (3.e)

Solution 6.1 (3.e)

Exercise 6.2 (3.2)

Solution 6.2 (3.2)

Exercise 7.0 (Setup)

Theorem 3.10*

Exercise 7.1 (3.f)

Solution 7.1 (3.f)

Part 1

Interpretation of the 3 formulas

Argument

Part 2

Exercise 7.2 (3.g)

Solution 7.2 (3.g)

Exercise 8.0 (Theorems from the script)

Exercise 8.1 (4.a)

Solution 8.1 (4.a)

Exercise 8.2 (4.b)

Solution 8.2 (4.b)

Solution 8.2.a (4.b.a)

Solution 8.2.b (4.b.a)

Exercise 8.3 (4.c)

Exercise 8.3.a (4.c)

Exercise 8.3.b (4.c)

Solution 8.3 (4.c)

Solution 8.3.a (4.c)

Solution 8.3.b (4.c)

Exercise 9.1 (4.d)

Solution 9.1 (4.d)

Exercise 9.2 (4.e)

Solution 9.2 (4.e)

Exercise 9.3 (4.f)

Solution 9.3 (4.g)

Exercise 10.1 (5.a)

Solution 10.1 (5.a)

Exercise 10.2 (5.b)

Solution 10.2 (5.b)

Exercise 10.3 (5.c)

Solution 10.3 (5.c)

Notes: Applied Optimization

Tmp Notes

Lecture 1

Notes and important notions