Agustinus Kristiadi's Blog
http://wiseodd.github.io/
Sat, 23 May 2020 20:26:16 +0200Sat, 23 May 2020 20:26:16 +0200Jekyll v3.8.5De Rham Cohomology<p>One of the central points of the previous two posts about <a href="/techblog/2020/03/14/covector-field/">covector fields</a> and <a href="/techblog/2020/04/17/exterior-derivative/">differential forms</a> is the identification of closed and exact forms. If $d$ is the exterior derivative, a smooth differential form $\omega$ is <strong><em>closed</em></strong> if $d\omega = 0$. Meanwhile $\omega$ is <strong><em>exact</em></strong> if it can be written as $\omega = d\eta$. Since $d \circ d = 0$, every exact form is closed.</p>
<p>In Euclidean space, if $\omega$ is a covector field ($1$-form), the converse is true: In this space we can equivalently think about a conservative vector field, and see that it fulfills both conditions above. But, what about higher degree forms in an arbitrary smooth manifold? It turns out, the answer of this question depends on the topology of the manifold, as has been suggested in Example 4 of this <a href="/techblog/2020/03/14/covector-field/">post</a>: In a punctured Euclidean space $\R^n \setminus { 0 }$, a closed form is not necessarily exact due to the existence of the “hole” in $\R^n$.</p>
<h2 class="section-heading">Quotient vector spaces</h2>
<p>Recall that if $W \subseteq V$ are vector spaces, we can define their quotient space as follows. We define an <strong><em>equivalence relation</em></strong> $\sim$ on $V$ by stating that two elements $v_1, v_2$ of $V$ are equivalent (i.e. $v_1 \sim v_2$) if their difference $v_1-v_2$ is in $W$. The <strong><em>equivalence class of an element $v \in V$</em></strong> is then $[v] := \{ v + w : w \in W \}$. In words, $[v]$ consists of all elements of $V$ that are equivalent to $v$, which can be written as $(v + w)$ with $w \in W$, since $v - (v + w) = w \in W$ and thus $v \sim (v + w)$. The <strong><em>quotient space</em></strong> $V/W$ is defined as the vector space consisting of all equivalence classes of $V$ w.r.t. $W$.</p>
<p>Note in particular that the quotient $V/V$ is the zero vector space $\{ [0] \}$. This is because any two elements $v_1, v_2$ of $V$ are equivalent since $(v_1 - v_2) \in V$, thus for any $v$ in $V$, we have that $[v] = V = \{ 0 + v : v \in V \} = [0]$.</p>
<h2 class="section-heading">The de Rham cohomology groups</h2>
<p>Let $M$ be a smooth manifold and let $p$ be a non-negative integer. Since the exterior derivative $d: \Omega^p(M) \to \Omega^{p+1}(M)$, mapping smooth $p$-forms to smooth $(p+1)$-forms, is linear, its <a href="https://math.libretexts.org/Bookshelves/Linear_Algebra/Book%3A_A_First_Course_in_Linear_Algebra_(Kuttler)/09%3A_Vector_Spaces/9.08%3A_The_Kernel_and_Image_of_a_Linear_Map">kernel and image are linear subspaces</a>. We define</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{Z}^p(M) &:= \mathrm{Ker} \left( d: \Omega^p(M) \to \Omega^{p+1}(M) \right) \\
&= \{ \omega \in \Omega^p(M): d\omega = 0 \} \\
&= \{ \text{closed $p$-forms on $M$} \} \, ,
\end{align} %]]></script>
<p>and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{B}^p(M) &:= \mathrm{Im} \left( d: \Omega^{p-1}(M) \to \Omega^{p}(M) \right) \\
&= \{ d\omega \in \Omega^p(M) : \omega \in \Omega^{p-1}(M) \} \\
&= \{ \text{exact $p$-forms on $M$} \} \, .
\end{align} %]]></script>
<p>To handle the edge cases, we agree the convention that $\Omega^p(M)$ is the zero vector space when $p < 0$ or $p > n := \mathrm{dim} \, \, M$. Therefore, in particular, $\mathcal{B}^0(M) = 0$ and $\mathcal{Z}^p(M) = \Omega^n(M)$. Furthermore, due to the fact that every exact form is closed, we have that $\mathcal{B}^p(M) \subseteq \mathcal{Z}^p(M)$.</p>
<p>To summarize, we can picture the above construction as follows:</p>
<p>It therefore makes sense to define the <strong><em>$p$-th de Rham cohomology group of $M$</em></strong> to be the quotient vector space</p>
<script type="math/tex; mode=display">H^p_\text{dR}(M) := \frac{\mathcal{Z}^p(M)}{\mathcal{B}^p(M)} = \{ [\omega] : \omega \in \mathcal{Z}^p(M) \} \, .</script>
<p>The element $[\omega]$ of $H^p_\text{dR}(M)$ is called the <strong><em>cohomology class of $\omega$</em></strong> and if $[\omega] = [\omega’]$, we say they are <strong><em>cohomologous</em></strong>. Note that $H^p_\text{dR}(M) = 0$ for $p < 0$ or $p > n$ since $\Omega^p(M)$ are just the zero vector space in those cases.</p>
<p>This construction might seem familiar if one has some exposure to algebraic topology, esp. in simplicial or singular homology theory. Intuitively, one can substitute the space $\Omega^p(M)$ with a chain complex and the map $d$ with the boundary map to get the construction of simplicial/singular homology. The nomenclature <em><strong>co</strong>homology</em> here reflects to the fact that we are studying the elements of the “dual space” of a manifold.</p>
<p>Now, here is how the de Rham comohology groups can help us answering our main question: For $0 \leq p \leq n$, the quotient space $H^p_\text{dR}(M)$ is the zero vector space if and only if every closed $p$-form on $M$ is exact. This is because, in this case, we have that $\mathcal{Z}^p(M) = \mathcal{B}^p(M)$ and thus their quotient is trivial (i.e. zero).</p>
<p><strong>Example 1.</strong> The punctured plane $M := \R^2 \setminus \{ 0 \}$ in Example 4 of the <a href="asd">previous post</a> has $H^1_\text{dR}(M) \neq 0$ since we have shown that there is a closed covector field ($1$-form) on $M$ that is not exact. Meanwhile, Theorem 5 in the same post (the Poincaré lemma) implies that for any star-shaped open subset $U \subseteq \R^n$, we have that $H^1_\text{dR}(U) = 0$.</p>
<p class="right">//</p>
<p>Remarkably, de Rham cohomology is a <em>diffeomorphism invariance</em>. One can show (the proof is not shown here since we have not talked about pullbacks) that diffeomorphic smooth manifolds have isomorphic de Rham cohomology groups. Moreover, de Rham cohomology is actually a <em>topological invariance</em>.</p>
<p>The computation of de Rham cohomology groups is in general not easy. One can use homotopy invariance or Mayer-Vietoris theorem to do so. However, a lot of pre-requisites about topology is required and is thus outside of the scope of this post. However, for completeness, we will show them below. The first is about contractible manifolds.</p>
<p><strong>Theorem 2 (Cohomology of Contractible Manifolds).</strong> If $M$ is a contractible smooth manifold then $H^p_\text{dR}(M) = 0$ for $p \geq 1$.</p>
<p class="right">$\square$</p>
<p>As a consequence we can generalize the Poincaré Lemma for covector fields (Theorem 5 <a href="/techblog/2020/03/14/covector-field/">here</a>) to higher order differential forms.</p>
<p><strong>Theorem 3 (The Poincaré Lemma).</strong> If $U$ is a star-shaped open subset of $\R^n$, then $H^p_\text{dR}(U) = 0$ for $p \geq 1$.</p>
<p><em>Proof.</em> $U$ star-shaped implies that it is contractible.</p>
<p class="right">$\square$</p>
<p>A particular consequence is that, we can now see why we do not need to bother about the main question of this post in Euclidean spaces.</p>
<p><strong>Theorem 4 (Cohomology of Euclidean Spaces).</strong> For any integers $n \geq 0$ and $p \geq 1$, we have that $H^p_\text{dR}(\R^n) = 0$.</p>
<p><em>Proof.</em> $\R^n$ is star-shaped for any $n \geq 0$.</p>
<p class="right">$\square$</p>
<p>Finally, the following result (the proof requires Mayer-Vietoris theorem) formalizes what we have stated in Example 1.</p>
<p><strong>Theorem 5 (Cohomology of Punctured Euclidean Space).</strong> Let $n \geq 2$ be an integer, $x \in \R^n$, and $M := \R^n \setminus \{ x \}$. Then, the only nontrivial de Rham groups of $M$ are $H^0_\text{dR}(M)$ and $H^{n-1}_\text{dR}(M)$.</p>
<p class="right">$\square$</p>
<p>In Example 1, we have that $n = 2$, thus $gH^1_\text{dR}(\R^2 \setminus \{ 0 \})$ is not zero, and therefore closed covector fields on $\R^2 \setminus \{ 0 \}$ are not necessarily exact.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
</ol>
Fri, 01 May 2020 08:00:00 +0200
http://wiseodd.github.io/techblog/2020/05/01/de-rham-cohomology/
http://wiseodd.github.io/techblog/2020/05/01/de-rham-cohomology/mathtechblogTowards de Rham Cohomology, Part II: Differential Forms and the Exterior Derivative<p>A tensor, i.e. a multilinear map, can be symmetric or alternating, covariant or contravariant. Here, we focus on alternating covariant tensors. “Covariant” means that they take vectors as arguments and output a real number. “Alternating” means that this real number changes sign whenever two of those arguments are interchanged. These tensors are important and has special name: <strong><em>$k$-covectors</em></strong>.</p>
<p>We will begin by studying an important tool for manipulating $k$-covectors on $V$: the wedge product. We will then see how can we generalize $k$-covectors in smooth manifold setting, in the form of differential forms. The key of this article (and de Rham cohomology) is the exterior derivative of differential forms. As always, this is article is based on Lee’s smooth manifold book [1].</p>
<h2 class="section-heading">Covector fields on smooth manifolds</h2>
<p>Let $V$ be a finite-dimensional vector space and denote the space of $k$-covectors on $V$ by $\Lambda^k(V^*)$. If $T^k(V^*)$ denotes the space of $k$-tensors (both alternating and symmetric), then we can define the projection $\mathrm{Alt}: T^k(V^*) \to \Lambda^k(V^*)$, called <strong><em>alternation</em></strong>, by</p>
<script type="math/tex; mode=display">(\mathrm{Alt} \, \alpha)(v_1, \dots, v_k) := \frac{1}{k!} \sum_{\sigma \in S_k} (\mathrm{sgn} \, \sigma) \, \alpha(v_{\sigma(1)}, \dots, v_{\sigma(k)}) \, ,</script>
<p>where $S_k$ is the symmetric group (consisting of permutations) of $k$ elements, $\alpha \in T^k(V^*)$, and $v_1, \dots v_k \in V$. When $k = 2$, we get the familiar formula:</p>
<script type="math/tex; mode=display">(\mathrm{Alt} \, \beta)(v, w) = \frac{1}{2} \left( \beta(v, w) - \beta(w, v) \right) \, ,</script>
<p>where $\beta$ is a 2-tensor. Compare this to the symmetric-antisymmetric decomposition of matrices in linear algebra!</p>
<p>Let $V$ be a finite-dimensional vector space. Given $\omega \in \Lambda^k(V^*)$ and $\eta \in \Lambda^l(V^*)$, we define their <strong><em>wedge product</em></strong> to be the following $(k+l)$-covector:</p>
<script type="math/tex; mode=display">\omega \wedge \eta := \frac{(k + l)!}{k!l!} \mathrm{Alt}(\omega \otimes \eta) \, .</script>
<p>In words, it is the alternated tensor product of $\omega$ and $\eta$, times some coefficient that depends on the ranks of $\omega$ and $\eta$. The wedge product is bilinear, associative, anticommutative, and is characterized by</p>
<script type="math/tex; mode=display">\omega^1 \wedge \dots \wedge \omega^k(v_1, \dots, v_k) := \mathrm{det} \, (\omega^j(v_i)) \, ,</script>
<p>where $\omega^i, \dots, \omega^k$ are arbitrary covectors; $v_1, \dots, v_k$ are vectors; and $(\omega^j(v_i))$ is the matrix which coefficients are $\omega^j(v_i)$ for all $i, j$.</p>
<h2 class="section-heading">Differential forms and exterior derivatives</h2>
<p>We now apply what we know about $k$-covectors to smooth manifolds. Roughly, instead of any dual space of a finite dimensional vector space $V$, we are now working on the cotangent bundle $T^*M$ of a smooth manifold $M$. Let the space of the covariant alternating $k$-tensors on $M$ be denoted by</p>
<script type="math/tex; mode=display">\Lambda^k T^*M := \coprod_{p \in M} \Lambda^k(T^*_pM) \, .</script>
<p>A <strong><em>differential $k$-form</em></strong> is defined as a continuous tensor field on $M$ whose value at each point is an element of $\Lambda^k(T^*_pM)$. We furthermore denote the vector space of smooth $k$-forms by $\Omega^k(M)$ and define the wedge product pointwise by $(\omega \wedge \eta)_p := \omega_p \wedge \eta_p$. In any smooth chart, a $k$-form $\omega$ can be written locally as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\omega &= \sum_{i_1 < \dots < i_k} \omega_{i_1 \dots i_k} \, dx^{i_1} \wedge \dots \wedge dx^{i_k} \\
&=: \sum_I \omega_I \, dx^I \, .
\end{align} %]]></script>
<p>Some familiar examples of differential forms are: a continuous real-valued function (0-form or scalar field) and covector field (1-form).</p>
<p>Since a smooth 0-forms is just a smooth real-valued functions, we can ask ourselves whether there is a generalization of the concept of differentiation of a function to smooth $k$-forms. The answer is yes and it is called the exterior derivative. The motivation of the exterior derivative is based on the <a href="/techblog/2020/03/14/covector-field/">previous post</a> in this series.</p>
<p>Recall that not all 1-forms are differentials of functions. If $\omega$ is a 1-form, necessary condition for the existence of a smooth function $f$ s.t. $\omega = df$ is that $\omega$ be closed, i.e. its partial derivatives commutes in every coordinate chart. That is,</p>
<script type="math/tex; mode=display">\frac{\partial \omega_j}{\partial x^i} - \frac{\partial \omega_i}{\partial x^j} = 0</script>
<p>in every coordinate chart $x$. Notice that in general, the l.h.s. above is antisymmetric in the indices $i$ and $j$, so it can be seen as the $ij$-th component of an alternating 2-tensor field (2-form). We can define a 2-form $d\omega$ locally in each smooth chart by</p>
<script type="math/tex; mode=display">% <![CDATA[
d\omega = \sum_{i < j} \left( \frac{\partial \omega_j}{\partial x^i} - \frac{\partial \omega_i}{\partial x^j} \right) dx^i \wedge dx^j \, . %]]></script>
<p>So, $\omega$ is closed if and only if $d\omega = 0$ in each chart. It turns out that $d\omega$ is well-defined globally, independent to the choice of the coordinate chart, and can be generalized to differential forms of all degrees. For every smooth manifold $M$, there is a differential operator $d: \Omega^k(M) \to \Omega^{k+1}(M)$ satisfying $d(d\omega) = 0$ for all $k$-form $\omega$, which we will define after the construction on Euclidean space.</p>
<p>On Euclidean space, we can define $d$ as follows. Let</p>
<script type="math/tex; mode=display">\omega := \sum_J \omega_J \, dx^J</script>
<p>be a smooth $k$-form on an open subset $U \subseteq \R^n$, we define its <strong><em>exterior derivative</em></strong> $d\omega$ to be the following $(k+1)$-form:</p>
<script type="math/tex; mode=display">d\omega := \sum_J \sum_i \frac{\partial \omega_J}{\partial x^i} dx^i \wedge dx^{j_1} \dots \wedge dx^{j_k} \, .</script>
<p>This is indeed a generalization of the previous equation for 2-form since if $\omega$ is a 1-form, we have that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
d(\omega_j \, dx^j) &= \sum_{i,j} \frac{\partial \omega_j}{\partial x^i} dx^i \wedge dx^j \\
&= \sum_{i < j} \frac{\partial \omega_j}{\partial x^i} dx^i \wedge dx^j + \sum_{i > j} \frac{\partial \omega_j}{\partial x^i} dx^i \wedge dx^j \\
&= \sum_{i < j} \left( \frac{\partial \omega_j}{\partial x^i} - \frac{\partial \omega_i}{\partial x^j} \right) dx^i \wedge dx^j \, ,
\end{align} %]]></script>
<p>where the last equality is obtained by interchanging $i$ and $j$ in the second sum and use the alternating property $dx^j \wedge dx^i = -dx^i \wedge dx^j$. Moreover, if $f$ is a smooth real-valued function (i.e. a smooth $0$-form), we have</p>
<script type="math/tex; mode=display">df = \frac{\partial f}{\partial x^i} \, dx^i \,</script>
<p>which is familiar to us, since it is just the differential of $f$.</p>
<p><strong>Proposition 1 (Some properties of $d$ on $\R^n$).</strong> Let $d$ be the exterior derivative defined above. Then $d$ is linear over $\R$ and $d \circ d \equiv 0$.</p>
<p><em>Proof.</em> Let $\omega, \nu$ be smooth $k$-forms, $x$ be a coordinate chart, and $a, b \in \R$ be constants. For brevity, let $z^i := dx^i \wedge dx^{j_1} \dots \wedge dx^{j_k}$ By definition of $d$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
d(a \omega + b \nu) &= d(a \omega_j \, dx^j + b \nu_j \, dx^j) \\
&= \sum_J \sum_i \frac{\partial (a \omega_J + b \nu_J)}{\partial x^i} z^i \\
&= \sum_J \sum_i \left( a \frac{\partial \omega_J}{\partial x^i} + b \frac{\partial \nu_J}{\partial x^i} \right) z^i \\
&= a \sum_J \sum_i \frac{\partial \omega_J}{\partial x^i} z^i + b \sum_J \sum_i \frac{\partial \nu_J}{\partial x^i} z^i \\
&= a \, d\omega + b \, d\nu \, ,
\end{align} %]]></script>
<p>which proves the linearity of $d$. For the second statement, we consider first the special case of a $0$-form $f$. We have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
d(df) &= d\left( \frac{\partial f}{\partial x^j} dx^j \right) = \frac{\partial^2 f}{\partial x^i \partial x^j} dx^i \wedge dx^j \\
&= \sum_{i < j} \left( \frac{\partial^2 f}{\partial x^i \partial x^j} - \frac{\partial^2 f}{\partial x^j \partial x^i} \right) dx^i \wedge dx^j = 0 \, ,
\end{align} %]]></script>
<p>since the second partial derivatives of $f$ commute.</p>
<p>For the general case, we requires the result that is not discussed in this post (see e.g. Prop. 14.23 (b) of Lee’s book), namely: If $\omega$ is a smooth $k$-form and $\nu$ is a smooth $l$-form on an open subset $U$ of $\R^n$, then</p>
<script type="math/tex; mode=display">d(\omega \wedge \nu) = d\omega \wedge \nu + (-1)^k \omega \wedge d\nu \, .</script>
<p>Using this result, the linearity of $d$, and keeping in mind that each $\omega_J$ and $x^{j_i}$ are just $0$-forms (real-valued functions), we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
d(d\omega) &= d ( \sum_J d\omega_J \wedge dx^{j_1} \dots \wedge dx^{j_k} ) \\
&= \sum_J d\left(d\omega_J \wedge (dx^{j_1} \wedge \dots \wedge dx^{j_k}) \right) \\
&= \sum_J d(d\omega_J) \wedge dx^{j_1} \wedge \dots \wedge dx^{j_k} + d\omega_J \wedge d(dx^{j_1} \wedge \dots \wedge dx^{j_k}) \\
&= \sum_J d(d\omega_J) \wedge dx^{j_1} \wedge \dots \wedge dx^{j_k} \\
& \qquad + \sum_{i=1}^k (-1)^i \, d\omega_J \wedge dx^{j_1} \dots \wedge d(dx^{j_i}) \wedge \dots \wedge dx^{j_k} \, .
\end{align} %]]></script>
<p>Applying the previous result for $0$-forms to $d(d\omega_J)$ and $d(dx^{j_i})$ sees the summands are all zero. Thus $d(d\omega) = 0$. (The last equality can be seen more clearly if one write out explicitly what are $d(dx^{j_1} \wedge dx^{j_2})$, $d(dx^{j_1} \wedge dx^{j_2} \wedge dx^{j_3})$, and so on; and by taking into account that the wedge product is associative.)</p>
<p class="right">$\square$</p>
<p>We can then use the above properties to define the exterior derivate on manifolds. Formally, if $M$ is a smooth manifold, the operators $d: \Omega^k(M) \to \Omega^{k+1}(M)$ for all $k$ are called <strong><em>exterior differentiation</em></strong>, if they satisfy the following four properties:</p>
<ol>
<li>$d$ is linear over $\R$.</li>
<li>If $\omega \in \Omega^k(M)$ and $\nu \in \Omega^l(M)$, then $d(\omega \wedge \nu) = d\omega \wedge \nu + (-1)^k \omega \wedge d\nu$. In particular, if $k = 1$, this is similar to the usual Leibniz property.</li>
<li>$d \circ d = 0$.</li>
<li>For $f \in \Omega^0(M) = C^\infty(M)$, $df$ is the differential of $f$.</li>
</ol>
<p>Moreover, in any smooth coordinate chart, $d$ is given by the exterior derivative defined before in Euclidean space.</p>
<p>We can extend the terminology we have from the previous post about covector fields. A smooth $k$-form $\omega$ is <strong><em>closed</em></strong> if $d\omega = 0$, and <strong><em>exact</em></strong> if there exists a smooth $(k-1)$-form $\nu$ on $M$ s.t. $\omega = d\nu$. Since $d \circ d = 0$, this implies that every exact form is closed.</p>
<p>The exterior differentiation will be the key for studying de Rham cohomology, which is our next and last post in this series.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
</ol>
Fri, 17 Apr 2020 15:00:00 +0200
http://wiseodd.github.io/techblog/2020/04/17/exterior-derivative/
http://wiseodd.github.io/techblog/2020/04/17/exterior-derivative/mathtechblogTowards de Rham Cohomology, Part I: Covector Fields<p>In vector calculus, we know that a vector field $V: U \to \R^3$ defined on an open subset $U \subseteq \R^3$ is called conservative if there exists $f \in C^1(U)$ such that</p>
<script type="math/tex; mode=display">V = \nabla f \, .</script>
<p>This kind of vector field is important since it (i) has zero curl (i.e. irrotational) and (ii) a line integral w.r.t. to this vector field is easy to compute. In term of dynamical systems, having zero (or less) curl could be beneficial since it can potentially converge to an equilibrium point faster.</p>
<p>Because of the reasons above, it is important to have an easy way to check whether a given vector field is conservative or not, other than proving that the function $f$ itself exists or not. In this article, we will see a more general view of these concepts on a smooth manifold and attempt to answer this question in this setting. Important ingredients for this are differential forms and de Rham cohomology groups.</p>
<p>We will begin by re-interpreting Euclidean vector fields as covector fields on smooth manifolds. We will then re-define the term ‘‘conservative’’ in this context. A necessary condition for a covector field to be conservative will then be presented. Finally, we will see that to identify a conservative covector field, one needs to take global topological properties of the manifold. This post can be seen as a summary of Lee’s smooth manifolds book [1] with additional notes from myself.</p>
<h2 class="section-heading">Covector fields on smooth manifolds</h2>
<p>For any smooth manifold $M$, the disjoint union</p>
<script type="math/tex; mode=display">T^*M := \coprod_{p \in M} T^*_p M</script>
<p>is called the <strong><em>cotangent bundle</em></strong> of $M$ (see <a href="/techblog/2019/02/22/riemannian-geometry/">the notes on Riemannian geometry</a>). A <strong><em>covector field</em></strong> (also known as a differential 1-form) is a function $M \to T^*M$ defined by $p \mapsto \omega(p) \in T^*_pM$. That is, a covector field assigns at each point a covector in the cotangent space of that point. Covector fields is the dual of vector field: whereas vector fields can be thought as assigning each point of $M$ with an arrow, covector fields can be thought as assigning at each point of $M$ a function that “measure” arrows at that point.</p>
<p>One important application of covector fields is that they generalize the notion of gradient vector field in the Euclidean space to smooth manifolds. To see this, let $f \in C^\infty(M)$. We define a covector field $df$, called the <strong><em>differential of f</em></strong>, by</p>
<script type="math/tex; mode=display">df_p(v) = vf \qquad \text{for} \enspace v \in T_pM</script>
<p>for each point $p \in M$. That is, at each point $p$, the differential $df_p(v)$ is just the directional derivative of $f$ in the direction $v$. Given coordinates $(x^i) := (x^1, \dots, x^n)$, on an open subset $U \subseteq M$ of an $n$-dimensional manifold $M$, the differential $df$ is represented point-wise by</p>
<script type="math/tex; mode=display">df_p = \frac{\partial f}{\partial x^i}(p) \, dx^i \vert_p \, .</script>
<p>(Note we use the Einstein summation convention.) That is each component of $df_p$ is just the partial derivative of $f$ evaluated at $p$. Notice that this is the definition of the gradient in the Euclidean space. So, in general, a “vector” consisting of all partial derivative of a function is not a gradient, but a differential. And it is not a vector field either, but a covector field. This distinction is important if we are in the non-Euclidean setting.</p>
<h2 class="section-heading">Line integrals</h2>
<p>Another application of covector fields is for generalizing the notion of a line integral that we learn in calculus to smooth manifolds. If $M$ is a smooth manifold, we define a <strong><em>smooth curve segment</em></strong> in $M$ by a continuous curve $\gamma: [a, b] \to M$ where $[a, b] \subset \R$ is a compact interval regarded as a smooth manifold with boundary. It is a <strong><em>piecewise smooth curve segment</em></strong> if there exists a finite partition $a = a_0 < a_1 < \dots < a_k = b$ of $[a, b]$ such that the restriction $\gamma \vert_{[a_{i-1}, a_i]}$ is smooth for each $i$.</p>
<p>Suppose $\omega$ is a covector field on $M$ and $\gamma:[a, b] \to M$ is a smooth curve segment on $M$. The <strong><em>line integral of $\omega$ over $\gamma$</em></strong> is the real number</p>
<script type="math/tex; mode=display">\int_\gamma \omega = \int^b_a \omega_{\gamma(t)}(\gamma'(t)) \, dt \, .</script>
<p>Moreover, if $\gamma$ is piecewise smooth, then</p>
<script type="math/tex; mode=display">\int_\gamma \omega = \sum_{i=1}^k \int_{a_{i-1}}^{a_i} \omega_{\gamma(t)}(\gamma'(t)) \, dt \, .</script>
<p>In the case where the covector field $\omega$ is a differential $df$ of some real-valued function $f$ on $M$, the computation of the line integral is trivial. This result is known as <em>the fundamental theorem for line integrals</em>.</p>
<p><strong>Theorem 1 (Fundamental Theorem for Line Integrals).</strong> <em>Let $M$ be a smooth manifold. Suppose $f$ is a smooth real-valued function on $M$ and $\gamma: [a, b] \to M$ is a piecewise smooth curve segment in $M$. Then</em></p>
<script type="math/tex; mode=display">\int_\gamma df = f(\gamma(b)) - f(\gamma(a)) \, .</script>
<p><em>Proof.</em> By definition above,</p>
<script type="math/tex; mode=display">\int_\gamma df = \int_a^b df_{\gamma(t)} (\gamma'(t)) \, dt \, .</script>
<p>Furthermore, we can show that</p>
<script type="math/tex; mode=display">\int_a^b df_{\gamma(t)} (\gamma'(t)) \, dt = \int_a^b (f \circ \gamma)'(t) \, dt \, .</script>
<p>Since $(f \circ \gamma)’$ is the derivative of the real-valued function $f \circ \gamma$ on $[a, b]$, therefore, by the fundamental theorem of calculus, $\int_\gamma df = f\circ\gamma(b) - f\circ\gamma(a)$.</p>
<p>Since $\gamma$ is piecewise smooth, let $a = a_0 < \dots < a_k = b$ be the endpoints of the sub-intervals of $\gamma$ s.t. for each $i$, the restriction $\gamma \vert_{[a_{i-1}, a_i]}$ is smooth. Hence,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\int_\gamma df &= \sum_{i=1}^k f \circ \gamma(a_i) - f \circ \gamma(a_{i-1}) = f \circ \gamma(b) - f \circ \gamma(a) \, ,
\end{align} %]]></script>
<p>since in the summation, everything cancels out except the terms $-f \circ \gamma(a)$ and $f \circ \gamma(b)$.</p>
<p class="right">$\square$</p>
<h2 class="section-heading">Conservative covector fields</h2>
<p>We say a smooth covector field $\omega$ on $M$ is <strong><em>exact</em></strong> if there is a function $f \in C^\infty(M)$ such that $\omega = df$. Because of the theorem above, it is therefore important to be able to identify covector fields that are exact. In other words, we want to know both necessary and sufficient conditions for a covector field to be exact. The theorem above provides a hint: If $\gamma$ is a <strong><em>closed curve segment</em></strong>, i.e. $\gamma(a) = \gamma(b)$, then $\int_\gamma df = 0$ for any $f \in C^\infty(M)$. Formally, we say a covector field $\omega$ is <strong><em>conservative</em></strong> if for every piecewise smooth closed curve $\gamma$, the line integral $\int_\gamma \omega$ is zero. The following result shows that conservativeness is equivalent to exactness.</p>
<p><strong>Theorem 2.</strong> <em>Let $M$ be a smooth manifold. A smooth covector field on $M$ is conservative if and only if it is exact.</em></p>
<p><em>Proof.</em> See [1], Theorem 11.42.</p>
<p class="right">$\square$</p>
<p>There is a necessary condition for a covector field to be exact (and hence conservative). First we need another definition. A smooth covector field $\omega$ is <strong><em>closed</em></strong> if its components in <em>every</em> smooth chart satisfy</p>
<script type="math/tex; mode=display">\frac{\partial \omega_j}{\partial x^i} = \frac{\partial \omega_i}{\partial x^j} \, .</script>
<p>(Note that, in the Euclidean case, this implies that the Jacobian of $\omega$ is symmetric.)</p>
<p><strong>Proposition 3.</strong> <em>Every exact covector field is closed.</em></p>
<p><em>Proof.</em> Let $\omega$ be an arbitrary exact covector field. Let $f \in C^\infty(M)$ s.t. $\omega = df$ and let $(U, (x^i))$ be any smooth chart on $M$. By <a href="https://en.wikipedia.org/wiki/Symmetry_of_second_derivatives#Schwarz's_theorem">Schwarz’s theorem</a>, $f$ satisfies</p>
<script type="math/tex; mode=display">\frac{\partial^2 f}{\partial x^i \partial x^j} = \frac{\partial^2 f}{\partial x^j \partial x^i}</script>
<p>on $U$. Furthermore, since $\omega = df$, its component $\omega_i$ is equal to $\partial f/ \partial x^i$ for all $i$, by the definition of $df$. Plugging it back to the previous equation, we have</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial x^j} \left( \frac{\partial f}{\partial x^i} \right) = \frac{\partial}{\partial x^i} \left( \frac{\partial f}{\partial x^j} \right) \iff \frac{\partial \omega_i}{\partial x^j} = \frac{\partial \omega_j}{\partial x^i} \, .</script>
<p>Thus, $\omega$ is closed as required.</p>
<p class="right">$\square$</p>
<p>Proposition 3 along with Theorem 2 therefore imply that we now have a necessary condition for a covector field to be conservative. The only thing left is the reverse direction: What are the conditions for any covector to be conservative? Here is a motivating example why this question might not be trivial to answer.</p>
<p><strong>Example 4.</strong> Let $M = \R^2 \setminus \{ 0 \}$ and let $\omega$ be the covector field on $M$ given by</p>
<script type="math/tex; mode=display">\omega := \frac{x \, dy - y \, dx}{x^2 + y^2} \, .</script>
<p>Let $\gamma: [0, 2\pi] \to M$ be the curve segment defined by $\gamma(t) := (\cos t, \sin t)$. Therefore $\gamma’(t) = (-\sin t \, dt, \cos t \, dt)$ and the line integral can be written as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\int_0^{2\pi} \omega_{\gamma(t)} (\gamma'(t)) dt &= \int_0^{2\pi} \frac{\cos t \, (\cos t \, dt) - \sin t \, (\sin t \, dt)}{\cos^2 t + \sin^2 t} \\
&= \int_0^{2\pi} \cos^2 t \, dt + \sin^2 t \, dt = \int_0^{2\pi} (\cos^2 t + \sin^2 t) \, dt \\
&= \int_0^{2\pi} dt = 2\pi \, .
\end{align} %]]></script>
<p>We can see that $\gamma$ is the counterclockwise unit circle and thus is a closed curve segment. However as we have just seen, the line integral is non-zero. Thus, $\omega$ is not conservative on $\R^2 \setminus \{ 0 \}$. But, $\omega$ is a closed covector field since</p>
<script type="math/tex; mode=display">\frac{\partial \omega_1}{\partial x} = \frac{y^2 - x^2}{(x^2 + y^2)^2} = \frac{\partial \omega_2}{\partial y} \, .</script>
<p>Therefore, this shows that in $\R^2 \setminus \{ 0 \}$, closedness does not necessarily imply exactness.</p>
<p>What if we restrict the domain of $\omega$ to be the right half-plane $U := \{ (x, y): x > 0 \}$ of $\R^2$? There, if we define $f: U \to \R$ by $\tan^{-1} y/x$, which is a smooth function on $U$, we can verify that $\omega = df$. Thus, in this case $\omega$ is exact and therefore conservative by Theorem 2.</p>
<p>As a further note, this problem can be seen more clearly if we think $f$ as the angle function $\theta = \tan^{-1} y/x$ of polar coordinates. On $\R^2 \setminus \{ 0 \}$, there are some discontinuities in $\theta$ which makes it non-smooth. One such discontinuities is at $\theta = 0$, since in this case $y$ must be zero.</p>
<p class="right">//</p>
<p>This example is the principal motivation of de Rham cohomology, which we will look into in a future article: we need to take into account the global topological properties of the domain of a covector field to answer whether it is exact or not. We can formalize the observation above in the following theorem.</p>
<p><strong>Theorem 5 (Poincaré Lemma for Covector Fields).</strong> <em>Let $U$ be a star-shaped open subset of $\R^n$ or $\mathbb{H}^n$. That is, $U$ has the following property: There exists $c \in U$ s.t. for every $x \in U$, the line segment from $c$ to $x$ is entirely contained in $U$. Then every closed covector field on $U$ is exact.</em></p>
<p><em>Proof.</em> Let $\omega = \omega_i dx^i$ be a closed covector field on $U$. Without loss of generality, we can assume that $c = 0$. For any $x \in U$, let $\gamma_x: [0, 1] \to U$ be the line segment from $0$ to $x$, defined by $\gamma_x(t) := tx$. Since $U$ is star-shape, the image of $\gamma_x$ lies entirely in $U$ for each $x \in U$. Define $f: U \to \R$ by</p>
<script type="math/tex; mode=display">f(x) := \int_{\gamma_x} \omega \, .</script>
<p>We need to show that $\omega = df$, i.e. $\partial f / \partial x^j = \omega_j$ for all $j$. We note that</p>
<script type="math/tex; mode=display">f(x) = \int_0^1 \omega_{\gamma_x(t)}(\gamma'_x(t)) \, dt = \int_0^1 \left( \sum_{i=1}^n \omega_i(tx) x^i \right) \, dt \, ,</script>
<p>where we have taken a similar step to the computation in Example 4. Now we need to compute the partial derivatives of $f$. Notice that all terms in the integrand are smooth. Hence, by <a href="https://en.wikipedia.org/wiki/Leibniz_integral_rule">Leibniz’s theorem</a>, we can exchange the differentiation and integral to obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial f}{\partial x^j}(x) &= \int_0^1 \left( \sum_{i=1}^n \frac{\partial}{\partial x^j} \omega_i(tx) x^i \right) \, dt \\
&= \int_0^1 \left( \sum_{i=1}^n \frac{\partial \omega_i}{\partial x^j}(tx) t x^i + \omega_i(tx) \delta^i_j \right) \, dt \\
&= \int_0^1 \left( \sum_{i=1}^n \frac{\partial \omega_i}{\partial x^j}(tx) t x^i + \omega_j(tx) \right) \, dt \, .
\end{align} %]]></script>
<p>Now, since $\omega$ is closed, we have $\partial \omega_i / \partial x^j = \partial \omega_j / \partial x^i$, and thus</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial f}{\partial x^j}(x) &= \int_0^1 \left( \sum_{i=1}^n \frac{\partial \omega_j}{\partial x^i}(tx) t x^i + \omega_j(tx) \right) \, dt \\
&= \int_0^1 \frac{d}{dt}(t \omega_j(tx)) \, dt \\
&= \left[ t \omega_j(tx) \right]_{t=0}^{t=1} \\
&= \omega_j(x) \, ,
\end{align} %]]></script>
<p>as required.</p>
<p class="right">$\square$</p>
<p>Here is a corollary of this theorem which states that every closed covector field is <em>locally</em> exact, regardless of the global topology of the space.</p>
<p><strong>Corollary 6.</strong> <em>Let $\omega$ be a closed covector field on a smooth manifold $M$. Then every point of $M$ has a neighborhood on which $\omega$ is exact.</em></p>
<p><em>Proof.</em> We need to show that for every $p \in M$, there exists a neighborhood $U$ containing $p$ s.t. $\omega$ is exact on $U$. Therefore, let $p \in M$ be arbitrary. Since $\omega$ is closed, by definition, we can pick some smooth coordinate ball $U$ containing $p$. Since balls are convex and therefore star-shaped, Theorem 5 implies that $\omega$ is exact on $U$. Since $p$ is arbitrary, this property holds for every $p \in M$. Thus $\omega$ is locally exact.</p>
<p class="right">$\square$</p>
<p>This corollary is useful in e.g. local analysis of a dynamical system around an equilibrium point: To show that the corresponding covector field is conservative, one only needs to show that its mixed partial derivatives commute there.</p>
<p>In the second part of this series of articles, we will generalize the notion of differential. We will talk about the <em>exterior derivative</em> of higher order differential forms. The exterior derivative is an essential ingredient for defining de Rham cohomology, which will be studied in the third and final part of this series.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
</ol>
Sat, 14 Mar 2020 08:00:00 +0100
http://wiseodd.github.io/techblog/2020/03/14/covector-field/
http://wiseodd.github.io/techblog/2020/03/14/covector-field/mathtechblogOptimization and Gradient Descent on Riemannian Manifolds<p>Geometry can be seen as a generalization of calculus on Riemannian manifolds. Objects in calculus such as gradient, Jacobian, and Hessian on $\R^n$ are adapted on arbitrary Riemannian manifolds. This fact let us also generalize one of the most ubiquitous problem in calculus: the optimization problem. The implication of this generalization is far-reaching: We can make a more general and thus flexible assumption regarding the domain of our optimization, which might fit real-world problems better or has some desirable properties.</p>
<p>In this article, we will focus on the most popular optimization there is, esp. in machine learning: the gradient descent method. We will begin with a review on the optimization problem of a real-valued function on $\R^n$, which we should have been familiar with. Next, we will adapt the gradient descent method to make it work in optimization problem of a real-valued function on an arbitrary Riemannian manifold $(\M, g)$. Lastly, we will discuss how <a href="/techblog/2018/03/14/natural-gradient/">natural gradient descent</a> method can be seen from this perspective, instead of purely from the second-order optimization point-of-view.</p>
<h2 class="section-heading">Optimization problem and the gradient descent</h2>
<p>Let $\R^n$ be the usual Euclidean space (i.e. a Riemannian manifold $(\R^n, \bar{g})$ where $\bar{g}_{ij} = \delta_{ij}$) and let $f: \R^n \to \R$ be a real-valued function. An (unconstrained) optimization problem on this space has the form</p>
<script type="math/tex; mode=display">\min_{x \in \R^n} f(x) \, .</script>
<p>That is we would like to find a point $\hat{x} \in \R^n$ such that $f(\hat{x})$ is the minimum of $f$.</p>
<p>One of the most popular numerical method for solving this problem is the gradient descent method. Its algorithm is as follows.</p>
<p><strong>Algorithm 1 (Euclidean gradient descent).</strong></p>
<ol>
<li>Pick arbitrary $x_{(0)} \in \R^n$ and let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $x_{(t)}$, i.e. $h_{(t)} := \gradat{f}{x_{(t)}}$</li>
<li>Move in the direction of $-h_{(t)}$, i.e. $x_{(t+1)} = x_{(t)} - \alpha h_{(t)}$</li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $x_{(t)}$</li>
</ol>
<p class="right">//</p>
<p>The justification of the gradient descent method is because of the fact that the gradient is the direction in which $f$ is increasing fastest. Its negative therefore points to the direction of steepest descent.</p>
<p><strong>Proposition 1.</strong> <em>Let $f: \R^n \to \R$ be a real-valued function on $\R^n$ and $x \in \R^n$. Among all unit vector $v \in \R^n$, the gradient $\grad f \, \vert_x$ of $f$ at $x$ is the direction in which the directional derivative $D_v \, f \, \vert_x$ is greatest. Furthermore, $\norm{\gradat{f}{x}}$ equals to the value of the directional derivative in that direction.</em></p>
<p><em>Proof.</em> First, note that, by our assumption, $\norm{v} = 1$. By definition of the directional derivative and dot product on $\R^n$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
D_v \, f \, \vert_x &= \grad f \, \vert_x \cdot v \\
&= \norm{\gradat{f}{x}} \norm{v} \cos \theta \\
&= \norm{\gradat{f}{x}} \cos \theta \, ,
\end{align} %]]></script>
<p>where $\theta$ is the angle between $\gradat{f}{x}$ and $v$. As $\norm{\cdot} \geq 0$ and $-1 \leq \cos \theta \leq 1$, the above expression is maximized whenever $\cos \theta = 1$. This implies that the particular vector $\hat{v}$ that maximizes the directional derivative points in the same direction as $\gradat{f}{x}$. Furthermore, plugging in $\hat{v}$ into the above equation, we get</p>
<script type="math/tex; mode=display">D_{\hat{v}} \, f \, \vert_x = \norm{\gradat{f}{x}} \, .</script>
<p>Thus, the length of $\gradat{f}{x}$ is equal to the value of $D_{\hat{v}} \, f \, \vert_x$.</p>
<p class="right">$\square$</p>
<h2 class="section-heading">Gradient descent on Riemannian manifolds</h2>
<p><strong>Remark.</strong> <em>These <a href="/techblog/2019/02/22/riemannian-geometry/">notes about Riemannian geometry</a> are useful as references. We shall use the Einstein summation convention: Repeated indices above and below are implied to be summed, e.g. $v^i w_i \implies \sum_i v^i w_i$ and $g_{ij} v^i v^j \implies \sum_{ij} g_{ij} v^i v^j$. By convention the index in $\partder{}{x^i}$ is thought to be a lower index.</em></p>
<p>We now want to break the confine of the Euclidean space. We would like to generalize the gradient descent algorithm on a function defined on a Riemannian manifold. Based on Algorithm 1, at least there are two parts of the algorithm that we need to adapt, namely, (i) the gradient of $f$ and (ii) the way we move between points on $\M$.</p>
<p>Suppose $(\M, g)$ is a $n$-dimensional Riemannian manifold. Let $f: \M \to R$ be a real-valued function (scalar field) defined on $\M$. Then, the optimization problem on $\M$ simply has the form</p>
<script type="math/tex; mode=display">\min_{p \in \M} f(p) \, .</script>
<p>Although it seems innocent enough (we only replace $\R^n$ with $\M$ from the Euclidean version), some difficulties exist.</p>
<p>First, we shall discuss about the gradient of $f$ on $\M$. By definition, $\grad{f}$ is a vector field on $\M$, i.e. $\grad{f} \in \mathfrak{X}(\M)$ and at each $p \in \M$, $\gradat{f}{p}$ is a tangent vector in $T_p \M$. Let the differential $df$ of $f$ be a one one-form, which, in given coordinates $\vx_p := (x^1(p), \dots, x^n(p))$, has the form</p>
<script type="math/tex; mode=display">df = \partder{f}{x^i} dx^i \, .</script>
<p>Then, the gradient of $f$ is obtained by raising an index of $df$. That is,</p>
<script type="math/tex; mode=display">\grad{f} = (df)^\sharp \, ,</script>
<p>and in coordinates, it has the expression</p>
<script type="math/tex; mode=display">\grad{f} = g^{ij} \partder{f}{x^i} \partder{}{x^j} \, .</script>
<p>At any $p \in \M$, given $v \in T_x \M$, it is characterized by the following equation:</p>
<script type="math/tex; mode=display">\inner{\gradat{f}{p}, v}_g = df(v) = vf \, .</script>
<p>That is, pointwise, the inner product of the gradient and any tangent vector is the action of derivation $v$ on $f$. We can think of this action as taking directional derivative of $f$ in the direction $v$. Thus, we have the analogue of Proposition 1 on Riemannian manifolds.</p>
<p><strong>Proposition 2.</strong> <em>Let $(\M, g)$ be a Riemannian manifold and $f: \M \to \R$ be a real-valued function on $\M$ and $p \in \M$. Among all unit vector $v \in T_p \M$, the gradient $\gradat{f}{p}$ of $f$ at $p$ is the direction in which the directional derivative $vf$ is greatest. Furthermore, $\norm{\gradat{f}{p}}$ equals to the value of the directional derivative in that direction.</em></p>
<p><em>Proof.</em> We simply note that by definition of inner product induced by $g$, we have</p>
<script type="math/tex; mode=display">\inner{u, w}_g = \norm{u}_g \norm{w}_g \cos \theta \qquad \forall \, u, w \in T_p \M \, ,</script>
<p>where $\theta$ is again the angle between $u$ and $w$. Using the characteristic of $\gradat{f}{p}$ we have discussed above and by substituting $vf$ for $D_v \, f \, \vert_p$ in the proof of Proposition 1, we immediately get the desired result.</p>
<p class="right">$\square$</p>
<p>Proposition 2 therefore provides us with a justification of just simply substituting the Euclidean gradient with the Riemannian gradient in Algorithm 1.</p>
<p>To make this concrete, we do the computation in coordinates. In coordinates, we can represent $df$ by a row vector $d$ (i.e. a sequence of numbers in the sense of linear algebra) containing all partial derivatives of $f$:</p>
<script type="math/tex; mode=display">d := \left( \partder{f}{x^1}, \dots, \partder{f}{x^n} \right) \, .</script>
<p>Given the matrix representation $G$ of the metric tensor $g$ in coordinates, the gradient of $f$ is represented by a column vector $h$, such that</p>
<script type="math/tex; mode=display">h = G^{-1} d^\T \, .</script>
<p><strong>Example 1. (Euclidean gradient in coordinates).</strong> Notice that in the Euclidean case, $\bar{g}_{ij} = \delta_{ij}$, thus it is represented by an identity matrix $I$, in coordinates. Therefore the Euclidean gradient is simply</p>
<script type="math/tex; mode=display">h = I^{-1} d^\T = d^\T \, .</script>
<p class="right">//</p>
<p>The second modification to Algorithm 1 that we need to find the analogue of is the way we move between points on $\M$. Notice that, at each $x \in \R^n$, the way we move between points in the Euclidean gradient descent is by following a straight line in the direction $\gradat{f}{x}$. We know by triangle inequality that straight line is the path with shortest distance between points in $\R^n$.</p>
<p>On Riemannian manifolds, we move between points by the means of curves. There exist a special kind of curve $\gamma: I \to \M$, where $I$ is an interval, that are “straight” between two points on $\M$, in the sense that the covariant derivative $D_t \gamma’$ of the velocity vector along the curve itself, at any time $t$ is $0$. The intuition is as follows: Although if we look at $\gamma$ on $\M$ from the outsider’s point-of-view, it is not straight (i.e. it follows the curvature of $\M$), as far as the inhabitants of $\M$ are concerned, $\gamma$ is straight, as its velocity vector (its direction and length) is the same everywhere along $\gamma$. Thus, geodesics are the generalization of straight lines on Riemannian manifolds.</p>
<p>For any $p \in \M$ and $v \in T_p \M$, we can show that there always exists a geodesic starting at $p$ with initial velocity $v$, denoted by $\gamma_v$. Furthermore, if $c, t \in \R$ we can rescale any geodesic $\gamma_v$ by</p>
<script type="math/tex; mode=display">\gamma_{cv}(t) = \gamma_v (ct) \, ,</script>
<p>and thus we can define a map $\exp_p(v): T_p \M \to \M$ by</p>
<script type="math/tex; mode=display">\exp_p(v) = \gamma_v(1) \, ,</script>
<p>called the exponential map. The exponential map is the generalization of “moving straight in the direction $v$” on Riemannian manifolds.</p>
<p><strong>Example 2. (Exponential map on a sphere).</strong> Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The shortest path between any pair of points on the sphere can be found by following the <a href="https://en.wikipedia.org/wiki/Great_circle">great circle</a> connecting them.</p>
<p>Let $p \in \mathbb{S}^n(r)$ and $0 \neq v \in T_p \mathbb{S}^n(r)$ be arbitrary. The curve $\gamma_v: \R \to \R^{n+1}$ given by</p>
<script type="math/tex; mode=display">\gamma_v(t) = \cos \left( \frac{t\norm{v}}{r} \right) p + \sin \left( \frac{t\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, ,</script>
<p>is a geodesic, as its image is the great circle formed by the intersection of $\mathbb{S}^n(r)$ with the linear subspace of $\R^{n+1}$ spanned by $\left\{ p, r \frac{v}{\norm{v}} \right\}$. Therefore the exponential map on $\mathbb{S}^n(r)$ is given by</p>
<script type="math/tex; mode=display">\exp_p(v) = \cos \left( \frac{\norm{v}}{r} \right) p + \sin \left( \frac{\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, .</script>
<p class="right">//</p>
<p>Given the exponential map, our modification to Algorithm 1 is complete, which we show in Algorithm 2. The new modifications from Algorithm 1 are in <span style="color:blue">blue</span>.</p>
<p><strong>Algorithm 2 (Riemannian gradient descent).</strong></p>
<ol>
<li>Pick arbitrary <span style="color:blue">$p_{(0)} \in \M$</span>. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $p_{(t)}$, i.e. <span style="color:blue">$h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$</span></li>
<li>Move in the direction $-h_{(t)}$, i.e. <span style="color:blue">$p_{(t+1)} = \exp_{p_{(t)}}(-\alpha h_{(t)})$</span></li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $p_{(t)}$</li>
</ol>
<h2 class="section-heading">Approximating the exponential map</h2>
<p>In general, the exponential map is difficult to compute, as to compute a geodesic, we have to solve a system of second-order ODE. Therefore, for a computational reason, we would like to approximate the exponential map with cheaper alternative.</p>
<p>Let $p \in \M$ be arbitrary. We define a map $R_p: T\M \to \M$ called the <strong><em>retraction</em></strong> map, by the following properties:</p>
<ol>
<li>$R_p(0) = p$</li>
<li>$dR_p(0) = \text{Id}_{T_p \M}$.</li>
</ol>
<p>The second property is called the <strong><em>local rigidity</em></strong> condition and it preserves gradients at $p$. In particular, the exponential map is a retraction. Furthermore, if $d_g$ denotes the Riemannian distance and $t \in \R$, retraction can be seen as a first-order approximation of the exponential map, in the sense that</p>
<script type="math/tex; mode=display">d_g(\exp_p(tv), R_p(tv)) = O(t^2) \, .</script>
<p>On an arbitrary embedded submanifold $\S \in \R^{n+1}$, if $p \in \S$ and $v \in T_p \S$, viewing $p$ to be a point on the ambient manifold and $v$ to be a point on the ambient tangent space $T_p \R^{n+1}$, we can compute $R_p(v)$ by (i) moving along $v$ to get $p + v$ and then (ii) project the point $p+v$back to $\S$.</p>
<p><strong>Example 3. (Retraction on a sphere).</strong> Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The retraction on any $p \in \mathbb{S}^n(r)$ and $v \in T_p \mathbb{S}^n(r)$ is defined by</p>
<script type="math/tex; mode=display">R_p(v) = r \frac{p + v}{\norm{p + v}}</script>
<p class="right">//</p>
<p>Therefore, the Riemannian gradient descent in Algorithm 2 can be modified to be</p>
<p><strong>Algorithm 3 (Riemannian gradient descent with retraction).</strong></p>
<ol>
<li>Pick arbitrary $p_{(0)} \in \M$. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $p_{(t)}$, i.e. $h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$</li>
<li>Move in the direction $-h_{(t)}$, i.e. <span style="color:blue">$p_{(t+1)} = R_{p_{(t)}}(-\alpha h_{(t)})$</span></li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $p_{(t)}$</li>
</ol>
<h2 class="section-heading">Natural gradient descent</h2>
<p>One of the most important applications of the Riemannian gradient descent in machine learning is for doing optimization of statistical manifolds. We define a statistical manifold $(\R^n, g)$ to be the set $\R^n$ corresponding to the set of parameter of a statistical model $p_\theta(z)$, equipped with metric tensor $g$ which is the Fisher information metric, given by</p>
<script type="math/tex; mode=display">g_{ij} = \E_{z \sim p_\theta} \left[ \partder{\log p_\theta(z)}{\theta^i} \partder{\log p_\theta(z)}{\theta^j} \right] \, .</script>
<p>The most common objective function $f$ in the optimization problem on a statistical manifold is the expected log-likelihood function of our statistical model. That is, given a dataset $\D = \{ z_i \}$, the objective is given by $f(\theta) = \sum_{z \in \D} \log p_\theta(z)$.</p>
<p>The metric tensor $g$ is represented by $n \times n$ matrix $F$, called the <a href="/techblog/2018/03/11/fisher-information/"><em>Fisher information matrix</em></a>. The Riemannian gradient in this manifold is therefore can be represented by a column vector $h = F^{-1} d^\T$. Furthermore, as the manifold is $\R^n$, the construction of the retraction map we have discussed previously tells us that we can simply do addition $p + v$ for any $p \in \R^n$ and $v \in T_p \R^n$. This is well defined as there is a natural isomorphism between $\R^n$ and $T_p \R^n$. All in all, the gradient descent in this manifold is called the <a href="/techblog/2018/03/14/natural-gradient/"><em>natural gradient descent</em></a> and is presented in Algorithm 4 below.</p>
<p><strong>Algorithm 4 (Natural gradient descent).</strong></p>
<ol>
<li>Pick arbitrary $\theta_{(0)} \in \R^n$. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $\theta_{(t)}$, i.e. $h_{(t)} := F^{-1} d^\T$</li>
<li>Move in the direction $-h_{(t)}$, i.e. $\theta_{(t+1)} = \theta_{(t)} - \alpha h_{(t)}$</li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $\theta_{(t)}$</li>
</ol>
<h2 class="section-heading">Conclusion</h2>
<p>Optimization in Riemannian manifold is an interesting and important application in the field of geometry. It generalizes the optimization methods from Euclidean spaces onto Riemannian manifolds. Specifically, in the gradient descent method, adapting it to a Riemannian manifold requires us to use the Riemannian gradient as the search direction and the exponential map or retraction to move between points on the manifold.</p>
<p>One major difficulty exists: Computing and storing the matrix representation $G$ of the metric tensor are very expensive. Suppose the manifold is $n$-dimensional. Then, the size of $G$ is in $O(n^2)$ and the complexity of inverting it is in $O(n^3)$. In machine learning, $n$ could be in the order of million, so a naive implementation is infeasible. Thankfully, many approximations of the metric tensor, especially for the Fisher information metric exist (e.g. [7]). Thus, even with these difficulties, the Riemannian gradient descent or its variants have been successfully applied on many areas, such as in inference problems [8], word or knowledge graph embeddings [9], etc.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
<li>Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.</li>
<li>Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).</li>
<li>Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.</li>
<li>Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.</li>
<li>Graphics: <a href="https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane">https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane</a>.</li>
<li>Martens, James, and Roger Grosse. “Optimizing neural networks with kronecker-factored approximate curvature.” International conference on machine learning. 2015.</li>
<li>Patterson, Sam, and Yee Whye Teh. “Stochastic gradient Riemannian Langevin dynamics on the probability simplex.” Advances in neural information processing systems. 2013.</li>
<li>Suzuki, Atsushi, Yosuke Enokida, and Kenji Yamanishi. “Riemannian TransE: Multi-relational Graph Embedding in Non-Euclidean Space.” (2018).</li>
</ol>
Fri, 22 Feb 2019 12:00:00 +0100
http://wiseodd.github.io/techblog/2019/02/22/optimization-riemannian-manifolds/
http://wiseodd.github.io/techblog/2019/02/22/optimization-riemannian-manifolds/mathtechblogNotes on Riemannian Geometry<p>Recently I have been studying differential geometry, including Riemannian geometry. When studying this subject, a lot of <em>aha</em> moments came up due to my previous (albeit informal) exposure to the geometric point-of-view of natural gradient method. I found that the argument from this point-of-view to be very elegant, which motivates me further to study geometry in depth. This writing is a collection of small notes (largely from Lee’s Introduction to Smooth Manifolds and Introduction to Riemannian Manifolds) that I find useful as a reference on this subject. Note that, this is by no means a completed article. I will update it as I study further.</p>
<h2 class="section-heading">Manifolds</h2>
<p>We are interested in generalizing the notion of Euclidean space into arbitrary smooth curved space, called smooth manifold. Intuitively speaking, a <strong><em>topological $n$-manifold</em></strong> $\M$ is a topological space that locally resembles $\R^n$. A <strong><em>smooth $n$-manifold</em></strong> is a topological $n$-manifold equipped with locally smooth map $\phi_p: \M \to \R^n$ around each point $p \in \M$, called the <strong><em>local coordinate chart</em></strong>.</p>
<p><strong>Example 1 (Euclidean spaces).</strong> For each $n \in \mathbb{N}$, the Euclidean space $\R^n$ is a smooth $n$-manifold with a single chart $\phi := \text{Id}_{\R^n}$, the identity map, for all $p \in \M$. Thus, $\phi$ is a <em>global coordinate chart</em>.</p>
<p class="right">//</p>
<p><strong>Example 2 (Spaces of matrices).</strong> Let $\text{M}(m \times n, \R)$ denote the set of $m \times n$ matrices with real entries. We can identify it with $\R^{mn}$ and as before, this is a smooth $mn$-dimensional manifold. Some of its subsets, e.g. the general linear group $\text{GL}(n, \R)$ and the space of full rank matrices, are smooth manifolds.</p>
<p class="right">//</p>
<p><strong>Remark 1.</strong> We will drop $n$ when referring a smooth $n$-manifold from now on, for brevity sake. Furthermore, we will start to use the <strong><em>Einstein summation convention</em></strong>: repeated indexes above and below are implied to be summed, e.g. $v_i w^i := \sum_i v_i w^i$.</p>
<p class="right">//</p>
<h2 class="section-heading">Tangent vectors and covectors</h2>
<p>At each point $p \in \M$, there exists a vector space $T_p \M$, called the <strong><em>tangent space</em></strong> of $p$. An element $v \in T_p \M$ is called the <strong><em>tangent vector</em></strong>. Let $f: \M \to \R$ be a smooth function. In local coordinate $\{x^1, \dots, x^n\}$ defined around $p$, the coordinate vectors $\{ \partial/\partial x^1, \dots, \partial/\partial x^n \}$ form a <strong><em>coordinate basis</em></strong> for $T_p \M$.</p>
<p>A tangent vector $v \in T_p \M$ can also be seen as a <strong><em>derivation</em></strong>, a linear map $C^\infty(\M) \to \R$ that follows Leibniz rule (product rule of derivative), i.e.</p>
<script type="math/tex; mode=display">v(fg) = f(p)vg + g(p)vf \enspace \enspace \forall f, g \in C^\infty(\M) \, .</script>
<p>Thus, we can also see $T_p \M$ to be the set of all derivations of $C^\infty(\M)$ at $p$.</p>
<p>For each $p \in \M$ there also exists the dual space $T_p^* \M$ of $T_p \M$, called the <strong><em>cotangent space</em></strong> at $p$. Each element $\omega \in T_p^* \M$ is called the <strong><em>tangent covector</em></strong>, which is a linear functional $\omega: T_p \M \to \R$ acting on tangent vectors at $p$. Given the same local coordinate as above, the basis for the cotangent space at $p$ is called the <strong><em>dual coordinate basis</em></strong> and is given by $\{ dx^1, \dots, dx^n \}$, such that $dx^i(\partial/\partial x^j) = \delta^i_j$ the Kronecker delta. Note that, this implies that if $v := v^i \, \partial/\partial x^i$, then $dx^i(v) = v^i$.</p>
<p>Tangent vectors and covectors follow different transformation rules. We call an object with lower index, e.g. the components of tangent covector $\omega_i$ and the coordinate basis $\partial/\partial x^i =: \partial_i$, to be following the <strong><em>covariant</em></strong> transformation rule. Meanwhile an object with upper index, e.g. the components a tangent vector $v^i$ and the dual coordinate basis $dx^i$, to be following the <strong><em>contravariant</em></strong> transformation rule. These stem from how an object transform w.r.t. change of coordinate. Recall that a vector, when all the basis vectors are scaled up by a factor of $k$, the coefficients in its linear combination will be scaled by $1/k$, thus it is said that a vector transforms <em>contra</em>-variantly (the opposite way to the basis). Analogously, we can show that when we apply the same transformation to the dual basis, the covectors coefficients will be scaled by $k$, thus it transforms the same way to the basis (<em>co</em>-variantly).</p>
<p>The partial derivatives of a scalar field (real valued function) on $\M$ can be interpreted as the components of a covector field in a coordinate-independent way. Let $f$ be such scalar field. We define a covector field $df: \M \to T^* \M$, called the <strong><em>differential</em></strong> of $f$, by</p>
<script type="math/tex; mode=display">df_p(v) := vf \enspace \enspace \text{for} \, v \in T_p\M \, .</script>
<p>Concretely, in smooth coordinates $\{ x^i \}$ around $p$, we can show that it can be written as</p>
<script type="math/tex; mode=display">df_p := \frac{\partial f}{\partial x^i} (p) \, dx^i \, \vert_p \, ,</script>
<p>or as an equation between covector fields instead of covectors:</p>
<script type="math/tex; mode=display">df := \frac{\partial f}{\partial x^i} \, dx^i \, .</script>
<p>The disjoint union of the tangent spaces at all points of $\M$ is called the <strong><em>tangent bundle</em></strong> of $\M$</p>
<script type="math/tex; mode=display">TM := \coprod_{p \in \M} T_p \M \, .</script>
<p>Meanwhile, analogously for the cotangent spaces, we define the <strong><em>cotangent bundle</em></strong> of $\M$ as</p>
<script type="math/tex; mode=display">T^*M := \coprod_{p \in \M} T^*_p \M \, .</script>
<p>If $\M$ and $\mathcal{N}$ are smooth manifolds and $F: \M \to \mathcal{N}$ is a smooth map, for each $p \in \M$ we define a map</p>
<script type="math/tex; mode=display">dF_p : T_p \M \to T_{F(p)} \mathcal{N} \, ,</script>
<p>called the <strong><em>differential</em></strong> of $F$ at $p$, as follows. Given $v \in T_p \M$:</p>
<script type="math/tex; mode=display">dF_p (v)(f) := v(f \circ F) \, .</script>
<p>Moreover, for any $v \in T_p \M$, we call $dF_p (v)$ the <strong><em>pushforward</em></strong> of $v$ by $F$ at $p$. It differs from the previous definition of differential in the sense that this map is a linear map between tangent spaces of two manifolds. Furthermore the differential of $F$ can be seen as the generalization of the total derivative in Euclidean spaces, in which $dF_p$ is represented by the Jacobian matrix.</p>
<h2 class="section-heading">Vector fields</h2>
<p>If $\M$ is a smooth $n$-manifold, a <strong><em>vector field</em></strong> on $\M$ is a continuous map $X: \M \to T\M$, written as $p \mapsto X_p$, such that $X_p \in T_p \M$ for each $p \in \M$. If $(U, (x^i))$ is any smooth chart for $\M$, we write the value of $X$ at any $p \in U \subset \M$ as</p>
<script type="math/tex; mode=display">X_p = X^i(p) \, \frac{\partial}{\partial x^i} \vert_p \, .</script>
<p>This defines $n$ functions $X^i: U \to \R$, called the <strong><em>component functions</em></strong> of $X$. The restriction of $X$ to $U$ is a smooth vector field if and only if its component functions w.r.t. the chart are smooth.</p>
<p><strong>Example 3 (Coordinate vector fields).</strong> If $(U, (x^i))$ is any smooth chart on $\M$, then $p \mapsto \partial/\partial x^i \vert_p$ is a vector field on $U$, called the <strong><em>i-th coordinate vector field</em></strong>. It is smooth as its component functions are constant. This vector fields defines a basis of the tangent space at each point.</p>
<p class="right">//</p>
<p><strong>Example 4 (Gradient).</strong> If $f \in C^\infty(\M)$ is a real-valued function on $\M$, then the gradient of $f$ is a vector field on $\M$. See the corresponding section below for more detail.</p>
<p class="right">//</p>
<p>We denote $\mathfrak{X}(\M)$ to be the set of all smooth vector fields on $\M$. It is a vector space under pointwise addition and scalar multiplication, i.e. $(aX + bY)_p = aX_p + bY_p$. The zero element is the zero vector field, whose value is $0 \in T_p \M$ for all $p \in \M$. If $f \in C^\infty(\M)$ and $X \in \mathfrak{X}(\M)$, then we define $fX: \M \to T\M$ by $(fX)_p = f(p)X_p$. Note that this defines a multiplication of a vector field with a smooth real-valued function. Furthermore, if in addition, $g \in C^\infty(\M)$ and $Y \in \mathfrak{X}(\M)$, then $fX + gY$ is also a smooth vector field.</p>
<p>A <strong><em>local frame</em></strong> for $\M$ is an ordered $n$-tuple of vector fields $(E_1, \dots, E_n)$ defined on an open subset $U \subseteq M$ that is linearly independent and spans the tangent bundle, i.e. $(E_1 \vert_p, \dots, E_n \vert_p)$ form a basis for $T_p \M$ for each $p \in \M$. It is called a <strong><em>global frame</em></strong> if $U = M$, and a <strong><em>smooth frame</em></strong> if each $E_i$ is smooth.</p>
<p>If $X \in \mathfrak{X}(\M)$ and $f \in C^\infty(U)$, we define $Xf: U \to \R$ by $(Xf)(p) = X_p f$. $X$ also defines a map $C^\infty(\M) \to C^\infty(\M)$ by $f \mapsto Xf$ which is linear and Leibniz, thus it is a derivation. Moreover, derivations of $C^\infty(\M)$ can be identified with smooth vector fields, i.e. $D: C^\infty(\M) \to C^\infty(\M)$ is a derivation if and only if it is of the form $Df = Xf$ for some $X \in \mathfrak{X}(\M)$.</p>
<h2 class="section-heading">Tensors</h2>
<p>Let $\{ V_k \}$ and $U$ be real vector spaces. A map $F: V_1 \times \dots \times V_k \to U$ is said to be <strong><em>multilinear</em></strong> if it is linear as a function of each variable separately when the others are held fixed. That is, it is a generalization of the familiar linear and bilinear maps. Furthermore, we write the vector space of all multilinear maps $ V_1 \times \dots \times V_k \to U $ as $ \text{L}(V_1, \dots, V_k; U) $.</p>
<p><strong>Example 5 (Multilinear functions).</strong> Some examples of familiar multilinear functions are</p>
<ol>
<li>The <em>dot product</em> in $ \R^n $ is a scalar-valued bilinear function of two vectors. E.g. for any $ v, w \in \R^n $, the dot product between them is $ v \cdot w := \sum_i^n v^i w^i $, which is linear on both $ v $ and $ w $.</li>
<li>The <em>determinant</em> is a real-valued multilinear function of $ n $ vectors in $ \R^n $.</li>
</ol>
<p class="right">//</p>
<p>Let $\{ W_l \}$ also be real vector spaces and suppose</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
F&: V_1 \times \dots \times V_k \to \R \\
G&: W_1 \times \dots \times W_l \to \R
\end{align} %]]></script>
<p>be multilinear maps. Define a function</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
F \otimes G &: V_1 \times \dots \times V_k \times W_1 \times \dots \times W_l \to \R \\
F \otimes G &(v_1, \dots, v_k, w_1, \dots, w_k) = F(v_1, \dots, v_k) G(w_1, \dots, w_l) \, .
\end{align} %]]></script>
<p>From the multilinearity of $ F $ and $ G $ it follows that $ F \otimes G $ is also multilinear, and is called the <strong><em>tensor product of $ F $ and $ G $</em></strong>. I.e. tensors and tensor products are multilinear map with codomain in $ \R $.</p>
<p><strong>Example 6 (Tensor products of covectors).</strong> Let $ V $ be a vector space and $ \omega, \eta \in V^* $. Recall that they both a linear map from $ V $ to $ \R $. Therefore the tensor product between them is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\omega \otimes \eta &: V \times V \to \R \\
\omega \otimes \eta &(v_1, v_2) = \omega(v_1) \eta(v_2) \, .
\end{align} %]]></script>
<p class="right">//</p>
<p><strong>Example 7 (Tensor products of dual basis).</strong> Let $ \epsilon^1, \epsilon^2 $ be the standard dual basis for $ (\R^2)^* $. Then, the tensor product $ \epsilon^1 \otimes \epsilon^2: \R^2 \times \R^2 \to \R $ is the bilinear function defined by</p>
<script type="math/tex; mode=display">\epsilon^1 \otimes \epsilon^2(x, y) = \epsilon^1 \otimes \epsilon^2((w, x), (y, z)) := wz \, .</script>
<p class="right">//</p>
<p>We use the notation $ V_1^* \otimes \dots \otimes V_k^* $ to denote the space $ \text{L}(V_1, \dots, V_k; \R) $. Let $ V $ be a finite-dimensional vector space. If $ k \in \mathbb{N} $, a <strong><em>covariant</em> $ k $-tensor on $ V $</strong> is an element of the $ k $-fold tensor product $ V^* \otimes \dots \otimes V^* $, which is a real-valued multilinear function of $ k $ elements of $ V $ to $ \R $. The number $ k $ is called the <strong><em>rank</em></strong> of the tensor.</p>
<p>Analogously, we define a <strong><em>contravariant $ k $-tensor on $ V $</em></strong> to be an element of the element of the $ k $-fold tensor product $ V \otimes \dots \otimes V $. We can mixed the two types of tensors together: For any $ k, l \in \mathbb{N} $, we define a <strong><em>mixed tensor on $ V $ of type $ (k, l) $</em></strong> to be the tensor product of $ k $ such $ V $ and $ l $ such $ V^* $.</p>
<h2 class="section-heading">Riemannian metrics</h2>
<p>So far we have no mechanism to measure the length of (tangent) vectors like we do in standard Euclidean geometry, where the length of a vector $v$ is measured in term of the dot product $ \sqrt{v \cdot v} $. Thus, we would like to add a structure that enables us to do just that to our smooth manifold $\M$.</p>
<p>A <strong><em>Riemannian metric</em></strong> $ g $ on $ \M $ is a smooth symmetric covariant 2-tensor field on $ \M $ that is positive definite at each point. Furthermore, for each $ p \in \M $, $ g_p $ defines an inner product on $ T_p \M $, written $ \inner{v, w}_g = g_p(v, w) $ for all $ v, w \in T_p \M $. We call a tuple $(\M, g)$ to be a <strong><em>Riemannian manifold</em></strong>.</p>
<p>In any smooth local coordinate $\{x^i\}$, a Riemannian metric can be written as tensor product</p>
<script type="math/tex; mode=display">g = g_{ij} \, dx^i \otimes dx^j \, ,</script>
<p>such that</p>
<script type="math/tex; mode=display">g(v, w) = g_{ij} \, dx^i \otimes dx^j(v, w) = g_{ij} \, dx^i(v) dx^j(w) = g_{ij} \, v^i w^j \, .</script>
<p>That is we can represent $ g $ as a symmetric, positive definite matrix $ G $ taking two tangent vectors as its arguments: $ \inner{v, w}_g = v^\text{T} G w $. Furthermore, we can define a norm w.r.t. $g$ to be $\norm{\cdot}_g := \inner{v, v}_g$ for any $v \in T_p \M$.</p>
<p><strong>Example 8 (The Euclidean Metric).</strong> The simplest example of a Riemannian metric is the familiar <strong><em>Euclidean metric</em></strong> $g$ of $\R^n$ using the standard coordinate. It is defined by</p>
<script type="math/tex; mode=display">g = \delta_{ij} \, dx^i \otimes dx^j \, ,</script>
<p>which, if applied to vectors $v, w \in T_p \R^n$, yields</p>
<script type="math/tex; mode=display">g_p(v, w) = \delta_{ij} \, v^i w^j = \sum_{i=1}^n v^i w^i = v \cdot w \, .</script>
<p>Note that above, $\delta_{ij}$ is the Kronecker delta. Thus, the Euclidean metric can be represented by the $n \times n$ identity matrix.</p>
<p class="right">//</p>
<h2 class="section-heading">The tangent-cotangent isomorphism</h2>
<p>Riemannian metrics also provide an isomorphism between the tangent and cotangent space: They allow us to convert vectors to covectors and vice versa. Let $(\M, g)$ be a Riemannian manifold. We define an isomorphism $\hat{g}: T_p \M \to T_p^* \M$ as follows. For each $p \in \M$ and each $v \in T_p \M$</p>
<script type="math/tex; mode=display">\hat{g}(v) = \inner{v, \cdot}_g \, .</script>
<p>Notice that that $\hat{g}(v)$ is in $T_p^* \M$ as it is a linear functional over $T_p \M$. In any smooth coordinate $\{x^i\}$, by definition we can write $g = g_{ij} \, dx^i dx^j$. Thus we can write the isomorphism above as</p>
<script type="math/tex; mode=display">\hat{g}(v) = (g_{ij} \, v^i) \, dx^j =: v_i \, dx^j \, .</script>
<p>Notice that we transform a contravariant component $v^i$ (denoted by the upper index component $i$) to a covariant component $v_i = g_{ij} \, v^i$ (denoted by the lower index component $j$), with the help of the metric tensor $g$. Because of this, we say that we obtain a covector from a tangent vector by <strong><em>lowering an index</em></strong>. Note that, we can also denote this by “flat” symbol in musical sheets: $\hat{g}(v) =: v^\flat$.</p>
<p>As Riemannian metric can be seen as a symmetric positive definite matrix, it has an inverse $g^{ij} := g_{ij}^{-1}$, which we denote by moving the index to the top, such that $g^{ij} \, g_{jk} = g_{kj} \, g^{ji} = \delta^i_k$. We can then define the inverse map of the above isomorphism as $\hat{g}^{-1}: T_p^* \M \to T_p \M$, where</p>
<script type="math/tex; mode=display">\hat{g}^{-1}(\omega) = (g^{ij} \, \omega_j) \, \frac{\partial}{\partial x^i} =: \omega^i \, \frac{\partial}{\partial x^i} \, ,</script>
<p>for all $\omega \in T_p^* \M$. In correspondence with the previous operation, we are now looking at the components $\omega^i := g^{ij} \, \omega_j$, hence this operation is called <strong><em>raising an index</em></strong>, which we can also denote by “sharp” musical symbol: $\hat{g}^{-1}(\omega) =: \omega^\sharp$. Putting these two map together, we call the isomorphism between the tangent and cotangent space as the <strong><em>musical isomorphism</em></strong>.</p>
<h2 class="section-heading">The Riemannian gradient</h2>
<p>Let $(\M, g)$ be a Riemannian manifold, and let $f: \M \to \R$ be a real-valued function over $\M$ (i.e. a scalar field on $\M)$. Recall that $df$ is a covector field, which in coordinates has partial derivatives of $f$ as its components. We define a vector field called the <strong><em>gradient</em></strong> of $f$ by</p>
<script type="math/tex; mode=display">\begin{align}
\grad{f} := (df)^\sharp = \hat{g}^{-1}(df) \, .
\end{align}</script>
<p>For any $p \in \M$ and for any $v \in T_p \M$, the gradient satisfies</p>
<script type="math/tex; mode=display">\inner{\grad{f}, v}_g = vf \, .</script>
<p>That is, for each $p \in \M$ and for any $v \in T_p \M$, $\grad{f}$ is a vector in $T_p \M$ such that the inner product with $v$ is the derivation of $f$ by $v$. Observe the compatibility of this definition with standard multi-variable calculus: the directional derivative of a function in the direction of a vector is the dot product of its gradient and that vector.</p>
<p>In any smooth coordinate $\{x^i\}$, $\grad{f}$ has the expression</p>
<script type="math/tex; mode=display">\grad{f} = g^{ij} \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} \, .</script>
<p><strong>Example 9 (Euclidean gradient).</strong> On $\R^n$ with the Euclidean metric with the standard coordinate, the gradient of $f: \R^n \to \R$ is</p>
<script type="math/tex; mode=display">\grad{f} = \delta^{ij} \, \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^i} \, .</script>
<p>Thus, again it is coincide with the definition we are familiar with form calculus.</p>
<p class="right">//</p>
<p>All in all then, given a basis, in matrix notation, let $G$ be the matrix representation of $g$ and let $d$ be the matrix representation of $df$ (i.e. as a row vector containing all partial derivatives of $f$), then: $\grad{f} = G^{-1} d^\T$.</p>
<p>The interpretation of the gradient in Riemannian manifold is analogous to the one in Euclidean space: its direction is the direction of steepest ascent of $f$ and it is orthogonal to the level sets of $f$; and its length is the maximum directional derivative of $f$ in any direction.</p>
<h2 class="section-heading">Connections</h2>
<p>Let $(\M, g)$ be a Riemannian manifold and let $X, Y: \M \to T \M$ be a vector field. Applying the usual definition for directional derivative, the way we differentiate $X$ is by</p>
<script type="math/tex; mode=display">D_X \vert_p Y = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \, .</script>
<p>However, we will have problems: We have not defined what this expression $p+hX_p$ means. Furthermore, as $Y_{p+hX_p}$ and $Y_p$ live in different vector spaces $T_{p+hX_p} \M$ and $T_p \M$, it does not make sense to subtract them, unless there is a natural isomorphism between each $T_p \M$ and $\M$ itself, as in Euclidean spaces. Hence, we need to add an additional structure, called <strong><em>connection</em></strong> that allows us to compare different tangent vectors from different tangent spaces of nearby points.</p>
<p>Specifically, we define the <strong><em>affine connection</em></strong> to be a connection in the tangent bundle of $\M$. Let $\mathfrak{X}(\M)$ be the space of vector fields on $\M$; $X, Y, Z \in \mathfrak{X}(\M)$; $f, g \in C^\infty(\M)$; and $a, b \in \R$. The affine connection is given by the map</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla: \mathfrak{X}(\M) \times \mathfrak{X}(\M) &\to \mathfrak{X}(\M) \\
(X, Y) &\mapsto \nabla_X Y \, ,
\end{align} %]]></script>
<p>which satisfies the following properties</p>
<ol>
<li>$C^\infty(\M)$-linearity in $X$, i.e., $\nabla_{fX+gY} Z = f \, \nabla_X Z + g \, \nabla_Y Z$</li>
<li>$\R$-linearity in Y, i.e., $\nabla_X (aY + bZ) = a \, \nabla_X Y + b \, \nabla_X Z$</li>
<li>Leibniz rule, i.e., $\nabla_X (fY) = (Xf) Y + f \, \nabla_X Y$ .</li>
</ol>
<p>We call $\nabla_X Y$ the <strong><em>covariant derivative</em></strong> of $Y$ in the direction $X$. Note that the notation $Xf$ means $Xf(p) := D_{X_p} \vert_p f$ for all $p \in \M$, i.e. the directional derivative (it is a scalar field). Furthermore, notice that, covariant derivative and connection are the same thing and they are useful to generalize the notion of directional derivative to vector fields.</p>
<p>In any smooth local frame $(E_i)$ in $T \M$ on an open subset $U \in \M$, we can expand the vector field $\nabla_{E_i} E_j$ in terms of this frame</p>
<script type="math/tex; mode=display">\nabla_{E_i} E_j = \Gamma^k_{ij} E_k \,.</script>
<p>The $n^3$ smooth functions $\Gamma^k_{ij}: U \to \R$ is called the <strong><em>connection coefficients</em></strong> or the <strong><em>Christoffel symbols</em></strong> of $\nabla$.</p>
<p><strong>Example 10 (Covariant derivative in Euclidean spaces).</strong> Let $\R^n$ with the Euclidean metric be a Riemannian manifold. Then</p>
<script type="math/tex; mode=display">(\nabla_Y X)_p = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \enspace \enspace \forall p \in \M \, ,</script>
<p>the usual directional derivative, is a covariant derivative.</p>
<p class="right">//</p>
<p>There exists a unique affine connection for every Riemannian manifold $(\M, g)$ that satisfies</p>
<ol>
<li>Symmetry, i.e., $\nabla_X Y - \nabla_Y X = [X, Y]$</li>
<li>Metric compatible, i.e., $Z \inner{X, Y}_g = \inner{\nabla_Z X, Y}_g + \inner{X, \nabla_Z Y}_g$,</li>
</ol>
<p>for all $X, Y, Z \in \mathfrak{X}(\M)$. It is called the <strong><em>Levi-Civita connection</em></strong>. Note that, $[\cdot, \cdot]$ is the <strong>Lie bracket</strong>, defined by $[X, Y]f = X(Yf) - Y(Xf)$ for all $f \in C^\infty(\M)$. Note also that, the connection shown in Example 10 is the Levi-Civita connection for Euclidean spaces with the Euclidean metric.</p>
<h2 class="section-heading">Riemannian Hessians</h2>
<p>Let $(\M, g)$ be a Riemannian manifold equipped by the Levi-Civita connection $\nabla$. Given a scalar field $f: \M \to \R$ and any $X, Y \in \mathfrak{X}(\M)$, the <strong><em>Riemannian Hessian</em></strong> of $f$ is the covariant 2-tensor field $\text{Hess} \, f := \nabla^2 f := \nabla \nabla f$, defined by</p>
<script type="math/tex; mode=display">\text{Hess} \, f(X, Y) := X(Yf) - (\nabla_X Y)f = \inner{\nabla_X \, \grad{f}, Y}_g \, .</script>
<p>Another way to define Riemannian Hessian is to treat is a linear map $T_p \M \to T_p \M$, defined by</p>
<script type="math/tex; mode=display">\text{Hess}_{v} \, f = \nabla_v \, \grad{f} \, ,</script>
<p>for every $p \in \M$ and $v \in T_p \M$.</p>
<p>In any local coordinate $\{x^i\}$, it is defined by</p>
<script type="math/tex; mode=display">\text{Hess} \, f = f_{; i,j} \, dx^i \otimes dx^j := \left( \frac{\partial f}{\partial x^i \partial x^j} - \Gamma^k_{ij} \frac{\partial f}{\partial x^k} \right) \, dx^i \otimes dx^j \, .</script>
<p><strong>Example 11 (Euclidean Hessian).</strong> Let $\R^n$ be a Euclidean space with the Euclidean metric and standard Euclidean coordinate. We can show that connection coefficients of the Levi-Civita connection are all $0$. Thus the Hessian is defined by</p>
<script type="math/tex; mode=display">\text{Hess} \, f = \left( \frac{\partial f}{\partial x^i \partial x^j} \right) \, dx^i \otimes dx^j \, .</script>
<p>It is the same Hessian as we have seen in calculus.</p>
<p class="right">//</p>
<h2 class="section-heading">Geodesics</h2>
<p>Let $(\M, g)$ be a Riemannian manifold and let $\nabla$ be a connection on $T\M$. Given a smooth curve $\gamma: I \to \M$, a <strong><em>vector field along $\gamma$</em></strong> is a smooth map $V: I \to T\M$ such that $V(t) \in T_{\gamma(t)}\M$ for every $t \in I$. We denote the space of all such vector fields $\mathfrak{X}(\gamma)$. A vector field $V$ along $\gamma$ is said to be <strong><em>extendible</em></strong> if there exists another vector field $\tilde{V}$ on a neighborhood of $\gamma(I)$ such that $V = \tilde{V} \circ \gamma$.</p>
<p>For each smooth curve $\gamma: I \to \M$, the connection determines a unique operator</p>
<script type="math/tex; mode=display">D_t: \mathfrak{X}(\gamma) \to \mathfrak{X}(\gamma) \, ,</script>
<p>called the <strong><em>covariant derivative along $\gamma$</em></strong>, satisfying (i) linearity over $\R$, (ii) Leibniz rule, and (iii) if it $V \in \mathfrak{X}(\gamma)$ is extendible, then for all $\tilde{V}$ of $V$, we have that $ D_t V(t) = \nabla_{\gamma’(t)} \tilde{V}$.</p>
<p>For every smooth curve $\gamma: I \to \M$, we define the <strong><em>acceleration</em></strong> of $\gamma$ to be the vector field $D_t \gamma’$ along $\gamma$. A smooth curve $\gamma$ is called a <strong><em>geodesic</em></strong> with respect to $\nabla$ if its acceleration is zero, i.e. $D_t \gamma’ = 0 \enspace \forall t \in I$. In term of smooth coordinates $\{x^i\}$, if we write $\gamma$ in term of its components $\gamma(t) := \{x^1(t), \dots, x^n(t) \}$, then it follows that $\gamma$ is a geodesic if and only if its component functions satisfy the following <strong><em>geodesic equation</em></strong>:</p>
<script type="math/tex; mode=display">\ddot{x}^k(t) + \dot{x}^i(t) \, \dot{x}^j(t) \, \Gamma^k_{ij}(x(t)) = 0 \, ,</script>
<p>where we use $x(t)$ as an abbreviation for $\{x^1(t), \dots, x^n(t)\}$. Observe that, this gives us a hint that to compute a geodesic we need to solve a system of second-order ODE for the real-valued functions $x^1, \dots, x^n$.</p>
<p>Suppose $\gamma: [a, b] \to \M$ is a smooth curve segment with domain in the interval $[a, b]$. The <strong><em>length</em></strong> of $\gamma$ is</p>
<script type="math/tex; mode=display">L_g (\gamma) := \int_a^b \norm{\gamma'(t)}_g \, dt \, ,</script>
<p>where $\gamma’$ is the derivative (the velocity vector) of $\gamma$. We can then use curve segments as “measuring tapes” to measure the <strong><em>Riemannian distance</em></strong> from $p$ to $q$ for any $p, q \in \M$$</p>
<script type="math/tex; mode=display">d_g(p, q) := \inf \, \{L_g(\gamma) \, \vert \, \gamma: [a, b] \to \M \enspace \text{s.t.} \enspace \gamma(a) = p, \, \gamma(b) = q\} \, ,</script>
<p>over all curve segments $\gamma$ which have endpoints at $p$ and $q$. We call the particular $\gamma$ such that $L_g(\gamma) = d_g(p, q)$ as the <strong><em>length-minimizing curve</em></strong>. We can show that all geodesics are locally length-minimizing, and all length-minimizing curves are geodesics.</p>
<h2 class="section-heading">Parallel transport</h2>
<p>Let $(\M, g)$ be a Riemannian manifold with affine connection $\nabla$. A smooth vector field $V$ along a smooth curve $\gamma: I \to \M$ is said to be <strong><em>parallel</em></strong> along $\gamma$ if $D_t V = 0$ for all $t \in I$. Notice that a geodesic can then be said to be a curve whose velocity vector field is parallel along the curve.</p>
<p>Given $t_0 \in I$ and $v \in T_{\gamma(t_0)} \M$, we can show there exists a unique parallel vector field $V$ along $\gamma$ such that $V(t_0) = v$. This vector field is called the <strong><em>parallel transport</em></strong> of $v$ along $\gamma$. Now, for each $t_0, t_1 \in I$, we define a map</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&P^\gamma_{t_0 t_1} : T_{\gamma(t_0)} \M \to T_{\gamma(t_1)} \M \\
&P^\gamma_{t_0 t_1}(v) = V(t_1) \, ,
\end{align} %]]></script>
<p>called the <strong><em>parallel transport map</em></strong>. We can picture the concept of parallel transport by imagining that we are “sliding” a tangent vector $v$ along $\gamma$ such that the direction and the magnitude of $v$ is preserved.</p>
<p>Note that, the parallel transport map is a linear map with inverse $P^\gamma_{t_1 t_0}$, hence it is an isomorphism between two tangent spaces $T_{\gamma(t_0)} \M$ and $T_{\gamma(t_1)} \M$. We can therefore determine the covariant derivative along $\gamma$ using parallel transport:</p>
<script type="math/tex; mode=display">D_t V(t_0) = \lim_{t_1 \to t_0} \frac{P^\gamma_{t_1 t_0} \, V(t_1) - V(t_0)}{t_1 - t_0} \, ,</script>
<p>Moreover, we can also determine the connection $\nabla$ via parallel transport:</p>
<script type="math/tex; mode=display">\nabla_X Y \, \vert_p = \lim_{h \to 0} \frac{P^\gamma_{h 0} Y_{\gamma(h)} - Y_p}{h} \, ,</script>
<p>for every $p \in \M$.</p>
<p>Finally, if $A$ is a smooth vector field on $\M$, then $A$ is parallel on $\M$ if and only if $\nabla A = 0$.</p>
<h2 class="section-heading">The exponential map</h2>
<p>Geodesics with proportional initial velocities are related in a simple way. Let $(\M, g)$ be a Riemannian manifold equipped with the Levi-Civita connection. For every $p \in \M$, $v \in T_p \M$, and $c, t \in \R$,</p>
<script type="math/tex; mode=display">\gamma_{cv} (t) = \gamma_{v} (ct) \, ,</script>
<p>whenever either side is defined. This fact is compatible with our intuition on how speed and time are related to distance.</p>
<p>From the fact above, we can define a map from the tangent bundle to $\M$ itself, which sends each line through the origin in $T_p \M$ to a geodesic. Define a subset $\mathcal{E} \subseteq T\M$, the <strong><em>domain of the exponential map</em></strong> by</p>
<script type="math/tex; mode=display">\mathcal{E} := \{ v \in T\M : \gamma_v \text{ is defined on an interval containing } [0, 1] \} \, ,</script>
<p>and then define the <strong><em>exponential map</em></strong></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\text{exp}: \mathcal{E} \to \M \\
&\text{exp}(v) = \gamma_v(1) \, .
\end{align} %]]></script>
<p>For each $p \in \M$, the <strong><em>restricted exponential map</em></strong> at $p$, denoted $\text{exp}_p$ is the restriction of $\text{exp}$ to the set $\mathcal{E}_p := \mathcal{E} \cap T_p \M$.</p>
<p>The interpretation of the (restricted) exponential maps is that, given a point $p$ and tangent vector $v$, we follow a geodesic which has the property $\gamma(0) = p$ and $\gamma’(0) = v$. This is then can be seen as the generalization of moving around the Euclidean space by following straight line in the direction of velocity vector.</p>
<h2 class="section-heading">Curvature</h2>
<p>Let $(\M, g)$ be a Riemannian manifold. Recall that an <strong><em>isometry</em></strong> is a map that preserves distance. Now, $\M$ is said to be <strong><em>flat</em></strong> if it is locally isometric to a Euclidean space, that is, every point in $\M$ has a neighborhood that is isometric to an open set in $\R^n$. We say that a connection $\nabla$ on $\M$ satisfies the <strong><em>flatness criterion</em></strong> if whenever $X, Y, Z$ are smooth vector fields defined on an open subset of $\M$, the following identity holds:</p>
<script type="math/tex; mode=display">\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]} Z \, .</script>
<p>Furthermore, we can show that $(\M, g)$ is a flat Riemannian manifold, then its Levi-Civita connection satisfies the flatness criterion.</p>
<p><strong>Example 12 (Euclidean space is flat).</strong> Let $\R^n$ with the Euclidean metric be a Riemannian manifold, equipped with the Euclidean connection $\nabla$. Then, given $X, Y, Z$ smooth vector fields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_X \nabla_Y Z &= \nabla_X (Y(Z^k) \partial_k) = XY(Z^k) \partial_k \\
\nabla_Y \nabla_X Z &= \nabla_Y (X(Z^k) \partial_k) = YX(Z^k) \partial_k \, .
\end{align} %]]></script>
<p>The difference between them is</p>
<script type="math/tex; mode=display">(XY(Z^k) - YX(Z^k)) \partial_k = \nabla_{[X, Y]}Z \, ,</script>
<p>by definition. Thus</p>
<script type="math/tex; mode=display">\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]}Z \, .</script>
<p>Therefore, the Euclidean space with the Euclidean connection (which is the Levi-Civita connection on Euclidean space) is flat.</p>
<p class="right">//</p>
<p>Based on the above definition of the flatness criterion, then we can define a measure on how far away a manifold to be flat:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&R: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \mathfrak{X}(\M) \\
&R(X, Y)Z = \nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z - \nabla_{[X, Y]} Z \, ,
\end{align} %]]></script>
<p>which is a multilinear map over $C^\infty (\M)$, and is therefore a $(1, 3)$-tensor field on $\M$.</p>
<p>We can then define a covariant 4-tensor called the <strong><em>(Riemann) curvature tensor</em></strong> to be the $(0, 4)$-tensor field $Rm := R^\flat$, by lowering the contravariant index of $R$. Its action on vector fields is given by</p>
<script type="math/tex; mode=display">Rm(X, Y, Z, W) := \inner{R(X, Y)Z, W}_g \, .</script>
<p>In any local coordinates, it is written</p>
<script type="math/tex; mode=display">Rm = R_{ijkl} \, dx^i \otimes dx^j \otimes dx^k \otimes dx^l \, ,</script>
<p>where $R_{ijkl} = g_{lm} \, {R_{ijkl}}^m$. We can show that $Rm$ is a local isometry invariant. Furthermore, compatible with our intuition of the role of the curvature tensor, a Riemannian manifold is flat if and only if its curvature tensor vanishes identically.</p>
<p>Working with $4$-tensors are complicated, thus we want to construct simpler tensors that summarize some of the information contained in the curvature tensor. For that, first we need to define the trace operator for tensors. Let $T^{(k,l)}(V)$ denotes the space of tensors with $k$ covariant and $l$ contravariant components of a vector space $V$, the trace operator is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\text{tr}: T^{(k+1, l+1)}(V) \to T^{(k,l)}(V) \\
&(\text{tr} \, F)(\omega^1, \dots, \omega^k, v_1, \dots, v_l) = \text{tr}(F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot)) \, ,
\end{align} %]]></script>
<p>where the trace operator in the right hand side is the usual trace operator, as $F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot) \in T^{(1,1)}(V)$ is a $(1,1)$-tensor, which can be represented by a matrix. We can extend this operator to covariant tensors in Riemannian manifolds: If $h$ is any covariant $k$-tensor field with $k \geq 2$, we can raise one of its indices and obtain $(1, k-1)$-tensor $h^\sharp$. The trace of $h^\sharp$ is thus a covariant $(k-2)$-tensor field. All in all, we define the <strong><em>trace</em></strong> of $h$ w.r.t. $g$ as</p>
<script type="math/tex; mode=display">\text{tr}_g \, h := \text{tr}(h^\sharp) \, .</script>
<p>In coordinates, it is</p>
<script type="math/tex; mode=display">\text{tr}_g \, h = {h_i}^i = g^{ij} h_{ij} \, ,</script>
<p>which, in an orthonormal frame, it is given by the ordinary trace of the matrix $(h_{ij})$.</p>
<p>We now define the <strong><em>Ricci curvature</em></strong> or <strong><em>Ricci tensor</em></strong> $Rc$ which is the covariant 2-tensor field defined as follows:</p>
<script type="math/tex; mode=display">Rc(X, Y) := \text{tr}(Z \mapsto R(Z, X)Y) \, ,</script>
<p>for any vector fields $X, Y$. In local coordinates, its components are</p>
<script type="math/tex; mode=display">R_{ij} := {R_{kij}}^k = g^{km} \, R_{kijm} \, .</script>
<p>We can simplify it further: We define the <strong><em>scalar curvature</em></strong> to be the function $S$ to be the trace of the Ricci tensor:</p>
<script type="math/tex; mode=display">S := \text{tr}_g \, Rc = {R_i}^i = g^{ij} \, R_{ij} \, .</script>
<p>Thus the scalar curvature is a scalar field on $\M$.</p>
<h2 class="section-heading">Submanifolds</h2>
<p>Let $\M$ be a smooth manifold. An <strong><em>embedded or regular submanifold</em></strong> of $\M$ is a subset $\mathcal{S} \subset \M$ that is a manifold in the subspace topology, endowed with a smooth structure w.r.t. which the inclusion map $\mathcal{S} \hookrightarrow \M$ is a smooth embedding. We call the difference $\text{dim} \, \M - \text{dim} \, \mathcal{S}$ to be the <strong><em>codimension</em></strong> of $\mathcal{S}$ in $\M$, and $\M$ to be the <strong><em>ambient manifold</em></strong>. An <strong><em>embedded hypersurface</em></strong> is an embedded submanifold of codimension 1.</p>
<p><strong>Example 13 (Graphs as submanifolds).</strong> Suppose $\M$ is a smooth $m$-manifold, $\mathcal{N}$ is a smooth $n$-manifold, $U \subset \M$ is open, and $f: U \to \mathcal{N}$ is a smooth map. Let $\Gamma(f) \subseteq \M \times \mathcal{N}$ denote the graph of $f$, i.e.</p>
<script type="math/tex; mode=display">\Gamma(f) := \{ (x, y) \in \M \times \mathcal{N} : x \in U, y = f(x) \} \, .</script>
<p>Then $\Gamma(f)$ is an embedded $m$-submanifold of $\M \times \mathcal{N}$.</p>
<p>Furthermore, if $f: \M \to \mathcal{N}$ is a smooth map (notice that we have defined $f$ globally here), then $\Gamma(f)$ is <strong><em>properly embedded</em></strong> in $\M \times \mathcal{N}$, i.e. the inclusion map is a <a href="https://en.wikipedia.org/wiki/Proper_map">proper map</a>.</p>
<p class="right">//</p>
<p>Suppose $\M$ and $\N$ are smooth manifolds. Let $F: \M \to \N$ be a smooth map and $p \in \M$. We define the rank of $F$ at $p$ to be the <strong><em>rank</em></strong> of the linear map $dF_p: T_p\M \to T_{F(p)\N}$, i.e. the rank of the Jacobian matrix of $F$ in coordinates. If $F$ has the same rank $r$ at any point, we say that it has <strong><em>constant rank</em></strong>, written $\rank{F} = r$. Note that it is bounded by $\min \{ \dim{\M}, \dim{\N} \}$ and if it is equal to this bound, we say $F$ has <strong><em>full rank</em></strong> at $p$.</p>
<p>A smooth map $F: \M \to \N$ is called a <strong><em>smooth submersion</em></strong> if $dF$ is surjective at each point ($\rank{F} = \dim{\N}$). It is called a <strong><em>smooth immersion</em></strong> if $dF$ is injective at each point ($\rank{F} = \dim{\M}$).</p>
<p><strong>Example 14 (Submersions and immersions).</strong></p>
<ol>
<li>Suppose $\M_1, \dots, \M_k$ are smooth manifolds. Then each projection maps $\pi_i: \M_1 \times \dots \times \M_k \to \M_i$ is a smooth submersion. In particular $\pi: \R^{n+k} \to \R^n$ is a smooth submersion.</li>
<li>If $\gamma: I \to \M$ is a smooth curve in a smooth manifold $\M$, then $\gamma$ is a smooth immersion if and only if $\gamma’(t) \neq 0$ for all $t \in I$.</li>
</ol>
<p class="right">//</p>
<p>If $\M$ and $\N$ are smooth manifolds. A <strong><em>diffeomorphism</em></strong> from $\M$ to $\N$ is a smooth bijective map $F: \M \to \N$ that has a smooth inverse, and $\M$ and $\N$ are said to be <strong><em>diffeomorphic</em></strong>. $F$ is called a <strong><em>local diffeomorphism</em></strong> if every point $p \in \M$ has a neighborhood $U$ such that $F(U)$ is open in $\N$ and $F\vert_U: U \to F(U)$ is a diffeomorphism. We can show that $F$ is a local diffeomorphism if and only if it is both a smooth immersion and submersion. Furthermore, if $\dim{\M} = \dim{\N}$ and $F$ is either a smooth immersion or submersion, then it is a local diffeomorphism.</p>
<p>The <em>Global rank theorem</em> says that if $\M$ and $\N$ are smooth manifolds, and suppose $F: \M \to \N$ is a smooth map of constant rank, then it is (a) a smooth submersion if it is injective, (b) a smooth immersion if it is injective, and (c) a diffeomorphism if it is bijective.</p>
<p>If $\M$ and $\N$ are smooth manifolds, a <strong><em>smooth embedding</em></strong> of $\M$ into $\N$ is a smooth immersion $F: \M \to \N$ that is also a topological embedding (homeomorphism onto its image in the subspace topology).</p>
<p><strong>Example 15 (Smooth embeddings).</strong> If $\M$ is a smooth manifold and $U \subseteq \M$ is an open submanifold, the inclusion $U \hookrightarrow \M$ is a smooth embedding.</p>
<p class="right">//</p>
<p>Let $F: \M \to \N$ be an injective smooth immersion. If any of these condition holds, then $F$ is a smooth embedding: (a) $F$ is an open or closed map, (b) $F$ is a proper map, (c) $\M$ is compact, and (d) $\M$ has empty boundary and $\dim{\M} = \dim{\N}$.</p>
<h2 class="section-heading">The second fundamental form</h2>
<p>Let $(\M, g)$ be a Riemannian submanifold of a Riemannian manifold $(\tilde{\M}, \tilde{g})$. Then, $g$ is the induced metric $g = \iota_\M^* \tilde{g}$, where $\iota_\M: \M \hookrightarrow \tilde{\M}$ is the inclusion map. Note that, the expression $\iota^*_\M \tilde{g}$ is called the <strong><em>pullback metric</em></strong> or the <strong><em>induced metric</em></strong> of $\tilde{g}$ by $\iota_\M$ and is defined by</p>
<script type="math/tex; mode=display">\iota_\M^* \tilde{g}(u, v) := \tilde{g}(d\iota_\M(u), d\iota_\M(v)) \, ,</script>
<p>for any $u, v \in T_p \M$. Also, recall that $d\iota_\M$ is the pushforward (tangent map) by $\iota_\M$. Intuitively, we map the tangent vectors $u, v$ of $T_p \M$ to some tangent vectors of $T_{\iota_\M(p)} \tilde{\M}$ and use $\tilde{g}$ as the metric.</p>
<p>In this section, we will denote any geometric object of the ambient manifold with tilde, e.g. $\tilde{\nabla}, \tilde{Rm}$, etc. Note also that, we can use the inner product notation $\inner{u, v}$ to refer to $g$ or $\tilde{g}$, since $g$ is just the restriction of $\tilde{g}$ to pairs of tangent vectors in $T \M$.</p>
<p>We would like to compare the Levi-Civita connection of $\M$ with that of $\tilde{\M}$. First, we define orthogonal projection maps, called <strong><em>tangential</em></strong> and <strong><em>normal projections</em></strong> by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\pi^\top &: T \tilde{\M} \vert_\M \to T\M \\
\pi^\perp &: T \tilde{\M} \vert_\M \to N\M \, ,
\end{align} %]]></script>
<p>where $N\M$ is the <strong><em>normal bundle</em></strong> of $\M$, i.e. the set of all vectors normal to $\M$. If $X$ is a section of $T\tilde{\M}\vert_\M$, we use the shorthand notations $X^\top = \pi^\top X$ and $X^\perp = \pi^\perp X$.</p>
<p>Given $X, Y \in \mathfrak{X}(\M)$, we can extend them to vector fields on an open subset of $\tilde{\M}$, apply the covariant derivative $\tilde{\nabla}$, and then decompose at $p \in \M$ to get</p>
<script type="math/tex; mode=display">\tilde{\nabla}_X Y = (\tilde{\nabla}_X Y)^\top + (\tilde{\nabla}_X Y)^\perp \, .</script>
<p>Let $\Gamma(E)$ be the space of smooth sections of bundle $E$. For the second part, we define the <strong><em>second fundamental form</em></strong> of $\M$ to be a map $\two: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \Gamma(N\M)$ defined by</p>
<script type="math/tex; mode=display">\two(X, Y) = (\tilde{\nabla}_X Y)^\perp \, .</script>
<p>Meanwhile, we can show that, the first part is the covariant derivative w.r.t. the Levi-Civita connection of the induced metric on $\M$. All in all, the above equation can be written as the <strong><em>Gauss formula</em></strong>:</p>
<script type="math/tex; mode=display">\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y) \, .</script>
<p>The second fundamental form can also be used to evaluate extrinsic covariant derivatives of <em>normal</em> vector fields (instead of <em>tangent</em> ones above). For each normal vector field $N \in \Gamma(N\M)$, we define a scalar-valued symmetric bilinear form $\two_N: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \R$ by</p>
<script type="math/tex; mode=display">\two_N(X, Y) = \inner{N, \two(X, Y)} \, .</script>
<p>Let $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$ denote the self-adjoint linear map associated with this bilinear form, characterized by</p>
<script type="math/tex; mode=display">\inner{W_N(X), Y} = \two_N(X, Y) = \inner{N, \two(X, Y)} \, .</script>
<p>The map $W_N$ is called the <strong><em>Weingarten map</em></strong> in the direction of $N$. Furthermore we can show that the equation $(\tilde{\nabla}_X N)^\top = -W_N(X)$ holds and is called the <strong><em>Weingarten equation</em></strong>.</p>
<p>In addition to describing the difference between the intrinsic and extrinsic connections, the second fundamental form describes the difference between the curvature tensors of $\tilde{\M}$ and $\M$. The explicit formula is called the <strong><em>Gauss equation</em></strong> and is given by</p>
<script type="math/tex; mode=display">\tilde{Rm}(W, X, Y, Z) = Rm(W, X, Y, Z) - \inner{\two(W, Z), \two(X, Y)} + \inner{\two(W, Y), \two(X, Z)} \, .</script>
<p>To give a geometric interpretation of the second fundamental form, we study the curvatures of curves. Let $\gamma: I \to \M$ be a smooth unit-speed curve. We define the <strong><em>curvature</em></strong> of $\gamma$ as the length of the acceleration vector field, i.e. the function $\kappa: I \to \R$ given by $\kappa(t) := \norm{D_t \gamma’(t)}$. We can see this curvature of the curve as a quantitative measure of how far the curve deviates from being a geodesic. Note that, if $\M = \R^n$ the curvature agrees with the one defined in calculus.</p>
<p>Now, suppose that $\M$ is a submanifold in the ambient manifold $\tilde{\M}$. Every regular curve $\gamma: I \to \M$ has two distinct curvature: its <strong><em>intrinsic curvature</em></strong> $\kappa$ as a curve in $\M$ and its <strong><em>extrinsic curvature</em></strong> $\tilde{\kappa}$ as a curve in $\tilde{\M}$. The second fundamental form can then be used to compute the relationship between the two: For $p \in \M$ and $v \in T_p \M$, (i) $\two(v, v)$ is the $\tilde{g}$-acceleration at $p$ of the $g$-geodesic $\gamma_v$, and (ii) if $v$ is a unit vector, then $\norm{\two(v, v)}$ is the $\tilde{g}$-curvature of $\gamma_v$ at $p$.</p>
<p>The intrinsic and extrinsic accelerations of a curve are usually different. A Riemannian submanifold $(\M, g)$ of $(\tilde{\M}, \tilde{g})$ is said to be <strong><em>totally geodesic</em></strong> if every $\tilde{g}$-geodesic that is tangent to $\M$ at some time $t_0$ stays in $\M$ for all $t \in (t_0 - \epsilon, t_0 + \epsilon)$.</p>
<h2 class="section-heading">Riemannian hypersurfaces</h2>
<p>We focus on the case when $(\M, g)$ is an embedded $n$-dimensional Riemannian submanifold of an $(n+1)$-dimensional Riemannian manifold $(\tilde{\M}, \tilde{g})$. That is, $\M$ is a hypersurface of $\tilde{\M}$.</p>
<p>In this situation, at each point of $\M$, there are exactly two unit normal vectors. We choose one of these normal vector fields and call it $N$. We can replace the vector-valued second fundamental form above by a simpler scalar-valued form. The <strong><em>scalar second fundamental form</em></strong> of $\M$ is the symmetric covariant $2$-tensor field $h = \two_N$, i.e.</p>
<script type="math/tex; mode=display">h(X, Y) := \inner{N, \two(X, Y)} \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .</script>
<p>By the Gauss formula $\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y)$ and noting that $\nabla_X Y$ is orthogonal to $N$, we can rewrite the definition as $h(X, Y) = \inner{N, \tilde{\nabla}_X Y}$. Furthermore, since $N$ is a unit vector spanning $N\M$, we can write $\two(X, Y) = h(X, Y)N$. Note that the sign of $h$ depends on the normal vector field chosen.</p>
<p>The choice of $N$ also determines a Weingarten map $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$. In this special case of a hypersurface, we use the notation $s = W_N$ and call it the <strong><em>shape operator</em></strong> of $\M$. We can think of $s$ as the $(1, 1)$-tensor field on $\M$ obtained from $h$ by raising an index. It is characterized by</p>
<script type="math/tex; mode=display">\inner{sX, Y} = h(X, Y) \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .</script>
<p>As with $h$, the choice of $N$ determines the sign of $s$.</p>
<p>Note that at every $p \in \M$, $s$ is a self-adjoint linear endomorphism of the tangent space $T_p \M$. Let $v \in T_p \M$. From linear algebra, we know that there is a unit vector $v_0 \in T_p \M$ such that $v \mapsto \inner{sv, v}$ achieve its maximum among all unit vectors. Every such vector is an eigenvector of $s$ with eigenvalue $\lambda_0 = \inner{s v_0, v_0}$. Furthermore, $T_p \M$ has an orthonormal basis $(b_1, \dots, b_n)$ formed by the eigenvectors of $s$ and all of the eigenvalues $(\kappa_1, \dots \kappa_n)$ are real. (Note that this means for each $i$, $s b_i = \kappa_i b_i)$.) In this basis, both $h$ and $s$ are represented by diagonal matrices.</p>
<p>The eigenvalues of $s$ at $p \in \M$ are called the <strong><em>principal curvatures</em></strong> of $\M$ at $p$, and the corresponding eigenvectors are called the <strong><em>principal directions</em></strong>. Note that the sign of the principal curvatures depend on the choice of $N$. But otherwise both the principal curvatures and directions are independent of the choice of coordinates.</p>
<p>From the principal curvatures, we can compute other quantities: The <strong><em>Gaussian curvature</em></strong> which is defined as $K := \text{det}(s)$ and the <strong><em>mean curvature</em></strong> $H := (1/n) \text{tr}(s)$. In other words, $K = \prod_i \kappa_i$ and $H = (1/n) \sum_i \kappa_i$, since $s$ can be represented by a symmetric matrix.</p>
<p>The Gaussian curvature, which is a local isometric invariant, is connected to a global topological invariant, the <a href="https://en.wikipedia.org/wiki/Euler_characteristic">Euler characteristic</a>, through the <strong><em>Gauss-Bonnet theorem</em></strong>. Let $(\M, g)$ be a smoothly triangulated compact Riemannian 2-manifold, then</p>
<script type="math/tex; mode=display">\int_\M K \, dA = 2 \pi \, \chi(\M) \, ,</script>
<p>where $dA$ is its Riemannian density.</p>
<h2 class="section-heading">Hypersurfaces of Euclidean space</h2>
<p>Assume that $\M \subseteq \R^{n+1}$ is an embedded Riemannian $n$-submanifold (with the induced metric from the Euclidean metric). We denote geometric objects on $\R^{n+1}$ with bar, e.g. $\bar{g}$, $\overline{Rm}$, etc. Observe that $\overline{Rm} \equiv 0$, which implies that the Riemann curvature tensor of a hypersurface in $\R^{n+1}$ is completely determined by the second fundamental form.</p>
<p>In this setting we can give some very concrete geometric interpretation about quantities in hypersurfaces. First is for curves. For every $v \in T_p \M$, let $\gamma = \gamma_v : I \to \M$ be the $g$-geodesic in $\M$ with initial velocity $v$. The Gauss formula shows that the Euclidean acceleration of $\gamma$ at $0$ is $\gamma^{\prime\prime}(0) = \overline{D}_t \gamma’(0) = h(v, v)N_p$, thus $\norm{h(v, v)}$ is the Euclidean curvature of $\gamma$ at $0$. Furthermore, $h(v,v) = \inner{\gamma^{\prime\prime}(0), N_p} > 0$ iff. $\gamma^{\prime\prime}(0)$ points in the same direction as $N_p$. That is $h(v, v)$ is positive if $\gamma$ is curving in the direction of $N_p$ and negative if it is curving away from $N_p$.</p>
<p>We can show that the above Euclidean curvature can be interpreted in terms f the radius of the “best circular approximation”, just in Calculus. Suppose $\gamma: I \to \R^m$ is a unit-speed curve, $t_0 \in I$, and $\kappa(t_0) \neq 0$. We define a unique unit-speed parametrized circle $c: \R \to \R^m$ as the <strong><em>osculating circle</em></strong> at $\gamma(t_0)$, with the property that $c$ and $\gamma$ have the same position, velocity, and acceleration at $t=t_0$. Then, the Euclidean curvature of $\gamma$ at $t_0$ is $\kappa(t_0) = 1/R$ where $R$ is the radius of the osculating circle.</p>
<p>As mentioned before, to compute the curvature of a hypersurface in Euclidean space, we can compute the second fundamental form. Suppose $X: U \to \M$ is a smooth local parametrization of $\M$, $(X_1, \dots, X_n)$ is the local frame for $T \M$ determined by $X$, and $N$ is a unit normal field on $\M$. Then, the scalar second fundamental form is given by</p>
<script type="math/tex; mode=display">h(X_i, X_j) = \innerbig{\frac{\partial^2 X}{\partial u^i \partial u^j}, N} \, .</script>
<p>The implication of this is that it shows how the principal curvatures give a concise description of the local shape of the hypersurface by approximating the surface with the graph of a quadratic function. That is, we can show that there is an isometry $\phi: \R^{n+1} \to \R^{n+1}$ that takes $p \in \M$ to the origin and takes a neighborhood of it to a graph of the form $x^{n+1} = f(x^1, \dots, x^n)$, where</p>
<script type="math/tex; mode=display">f(x) = \frac{1}{2} \sum_{i=1}^n\kappa_i (x^i)^2 + O(\abs{x}^3) \, .</script>
<p>We can write down a smooth vector field $N = N^i \partial_i$ on an open subset of $\R^{n+1}$ that restricts to a unit normal vector field along $\M$. Then, the shape operator can be computed straightforwardly using the Weingarten equation and observing that the Euclidean covariant derivatives of $N$ are just ordinary directional derivatives in Euclidean space. Thus, for every vector $X = X^i \partial_j$ tangent to $\M$, we have</p>
<script type="math/tex; mode=display">sX = -\bar{\nabla}_X N = -\sum_{i,j=1}^{n+1} X^j (\partial_j N^i) \partial_i \, .</script>
<p>One common way to get such smooth vector field is to work with a local defining function $F$ for $\M$, i.e. a smooth scalar field defined on some open subset $U \subseteq \R^{n+1}$ s.t. $U \cap \M$ is a regular level set of $F$. Then, we can take</p>
<script type="math/tex; mode=display">N = \frac{\grad{F}}{\norm{\grad{F}}} \, .</script>
<p>Because we know that the gradient is always normal to the level set.</p>
<p><strong>Example 16 (Shape operators of spheres).</strong> The function $F: \R^{n+1} \to \R$ with $F(x) := \norm{x}^2$ is a smooth defining function of any sphere in $\mathbb{S}^{n}(R)$. Thus, the normalized gradient vector field</p>
<script type="math/tex; mode=display">N = \frac{1}{R} \sum_{i,j=1}^{n+1} x^i \partial_i</script>
<p>is a (outward pointing) unit normal vector field along $\mathbb{S}^n(R)$. The shape operator is</p>
<script type="math/tex; mode=display">sX = -\frac{1}{R} \sum_{i,j=1}^{n+1} X^j (\partial_j x^i) \partial_i = -\frac{1}{R} X \, ,</script>
<p>where recall that, $\partial_j x^i = \partial x^i / \partial x^j = \delta_{ij}$. We can therefore write $s$ as a matrix $s = (-1/R) \mathbf{I}$ where $\mathbf{I}$ is the identity matrix. The principal curvatures are then all equal to $-1/R$, the mean curvature is $H = -1/R$, and the Gaussian curvature is $K = (-1/R)^n$. Note that, these curvatures are constant. These reflects the fact that the sphere bends the exact same way at every point.</p>
<p class="right">//</p>
<p>Lastly, for surfaces in $\R^3$, given a parametrization of $X$, the normal vector field can be computed via the cross product:</p>
<script type="math/tex; mode=display">N = \frac{X_1 \times X_2}{\norm{X_1 \times X_2}} \, ,</script>
<p>where $X_1 := \partial_1 X$ and $X_2 := \partial_2 X$, which together form a basis of the tangent space at each point on the surface.</p>
<p>Although the Gaussian curvature is defined in terms of a particular embedding of a submanifold in the Euclidean space (i.e. it is an extrinsic quantity), it is actually an intrinsic invariant of the submanifold. Gauss showed in his <strong><em>Theorema Egregium</em></strong> that in an embedded $2$-dimensional Riemannian submanifold $(\M, g)$ of $\R^3$, for every point $p \in \M$, the Gaussian curvature of $\M$ at $p$ is equal to one-half the scalar curvature of $g$ at $p$, and thus it is a local isometry invariant of $(\M, g)$.</p>
<p>Suppose $\M$ is a Riemannian $n$-manifold with $n \geq 2$, $p \in \M$, and $V \subset T_p \M$ is a <a href="https://en.wikipedia.org/wiki/Star_domain">star-shaped neighborhood</a> of zero on which $\text{exp}_p$ is a diffeomorphism onto an open set $U \subset \M$. Let $\Pi$ be any $2$-dimensional linear subspace of $T_p \M$. Since $\Pi \cap V$ is an embedded $2$-dim submanifold of $V$, it follows that $\mathcal{S}_\Pi = \text{exp}_p(\Pi \cup V)$ is an embedded $2$-dim submmanifold of $U \subset \M$ containing $p$, called the <strong><em>plane section</em></strong> determined by $\Pi$. We define the <strong><em>sectional curvature</em></strong> of $\Pi$, denoted by $\text{sec}(\Pi)$, to be the intrinsic Gaussian curvature at $p$ of the surface $\mathcal{S}_\Pi$ with the metric induced from the embedding $\mathcal{S}_\Pi \subseteq \M$. If $v, w \in T_p \M$ are linearly independent vectors, the sectional curvature’s formula is given by</p>
<script type="math/tex; mode=display">\text{sec}(v, w) := \frac{Rm_p(v, w, w, v)}{\norm{v \wedge w}^2} \, ,</script>
<p>where</p>
<script type="math/tex; mode=display">\norm{v \wedge w} := \sqrt{\norm{v}^2 \norm{w}^2 - \inner{v, w}^2} \, .</script>
<p>We can show the connection between the sectional curvature and Ricci and scalar curvatures. $Rc_p(v, v)$ is the sum of the sectional curvatures of the $2$-planes spanned by $(v, b_2), \dots, (v, b_n)$, where $(b_1, \dots, b_n)$ is any orthonormal basis for $T_p \M$ with $b_1 = v$. Furthermore, the scalar curvature at $p$ is the sum of all sectional curvatures of the $2$-planes spanned by ordered pairs of distinct basis vectors in any orthonormal basis.</p>
<h2 class="section-heading">Lie groups</h2>
<p>A <strong><em>Lie group</em></strong> is a smooth manifold $\G$ that is also a group in the algebraic sense, with the property that the multiplication map $m: \G \times \G \to \G$ and inversion map $i: \G \to \G$, given by</p>
<script type="math/tex; mode=display">m(g, h) := gh \, , \qquad i(g) := g^{-1} \, ,</script>
<p>are both smooth for arbitrary $g, h \in \G$. We denote the identity element of $G$ by $e$.</p>
<p><strong>Example 17 (Lie groups).</strong> The following manifolds are Lie groups.</p>
<ol>
<li>
<p>The <strong><em>general linear group</em></strong> $\GL(n, \R)$ is the set of invertible $n \times n$ matrices with real elements. It is a group under matrix multiplication and it is a submanifold of the vector space $\text{M}(n, \R)$, the space of $n \times n$ matrices.</p>
</li>
<li>
<p>The real number field $\R$ and the Euclidean space $\R^n$ are Lie groups under addition.</p>
</li>
</ol>
<p class="right">//</p>
<p>If $\G$ and $\mathcal{H}$ are Lie groups, a <strong><em>Lie group homomorphism</em></strong> from $\G$ to $\mathcal{H}$ is a smooth map $F: \G \to \mathcal{H}$ that is also a group homomorphism. If $F$ is also a diffeomorphism, then it is a <strong><em>Lie group isomorphism</em></strong>. We say that $\G$ and $\mathcal{H}$ are <strong><em>isomorphic Lie groups</em></strong>.</p>
<p>If $G$ is a group and $M$ is a set, a <strong><em>left action</em></strong> of $G$ on $M$ is a map $G \times M \to M$ defined by $(g, p) \mapsto g \cdot p$ that satisfies</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{alignat}{2}
g_1 \cdot (g_2 \cdot p) &= (g_1 g_2) \cdot p \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\
e \cdot p &= p &&\text{for all } p \in M \, .
\end{alignat} %]]></script>
<p>Analogously, a <strong><em>right action</em></strong> is defined as a map $M \times G \to M$ satisfying</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{alignat}{2}
(p \cdot g_1) \cdot g_2 &= p \cdot (g_1 g_2) \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\
p \cdot e &= p &&\text{for all } p \in M \, .
\end{alignat} %]]></script>
<p>If $M$ is a smooth manifold, $G$ is a Lie group, and the defining map is smooth, then the action is said to be <strong><em>smooth action</em></strong>.</p>
<p>We can also give a name to an action, e.g. $\theta: G \times M \to M$ with $(g, p) \mapsto \theta_g (p)$. In this notation, the above conditions for the left action read</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\theta_{g_1} \circ \theta_{g_2} &= \theta_{g_1 g_2} \, , \\
\theta_e &= \Id_M \, ,
\end{align} %]]></script>
<p>while for a right action the first equation is replaced by $\theta_{g_2} \circ \theta_{g_1} = \theta_{g_1 g_2}$. For a smooth action, each map $\theta_g : M \to M$ is a diffeomorphism.</p>
<p>For each $p \in M$, the <strong><em>orbit</em></strong> of $p$, denoted by $G \cdot p$, is the set of all images of $p$ under the action by elements of $G$:</p>
<script type="math/tex; mode=display">G \cdot p := \{ g \cdot p : g \in G \} \, .</script>
<p>The <strong><em>isotropy group</em></strong> or <strong><em>stabilizer</em></strong> of $p$, denoted by $G_p$, is the set of elements of $G$ that fix $p$ (implying $G_p$ is a subgroup of $G$):</p>
<script type="math/tex; mode=display">G_p := \{ g \in G : g \cdot p = p \} \, .</script>
<p>A group action is said to be <strong><em>transitive</em></strong> if for every pair of points $p, q \in M$, there exists $g \in G$ such that $g \cdot p = q$, i.e. if the only orbit is all of $M$. The action is said to be <strong><em>free</em></strong> if the only element of $G$ that fixes any element of $M$ is the identity: $g \cdot p$ for some $p \in M$ implies $g = e$, i.e. if every isotropy group is trivial.</p>
<p><strong>Example 18 (Lie group actions).</strong></p>
<ol>
<li>
<p>If $\G$ is a Lie group and $\M$ is a smooth manifold, the <strong><em>trivial action</em></strong> of $\G$ on $\M$ is defined by $g \cdot p = p$ for all $g \in \G$ and $p \in \M$.</p>
</li>
<li>
<p>The <strong><em>natural action</em></strong> of $\GL(n, \R)$ on $\R^n$ is the left action given by matrix multiplication $(\b{A}, \vx) \mapsto \b{A} \vx$.</p>
</li>
</ol>
<p class="right">//</p>
<p>Let $\G$ be a Lie group, $\M$ and $\N$ be smooth manifolds endowed with smooth left or right $\G$-actions. A map $F: \M \to \N$ is <strong><em>equivariant</em></strong> w.r.t. the given actions if for each $g \in G$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{alignat}{2}
F(g \cdot p) &= g \cdot F(p) \qquad &&\text{for left actions} \, , \\
F(p \cdot g) &= F(p) \cdot g &&\text{for right actions} \, .
\end{alignat} %]]></script>
<p>If $F: \M \to \N$ is a smooth map that is equivariant w.r.t. a transitive smooth $\G$-action on $\M$ and any smooth $\G$-action on $\N$, then $F$ has <strong><em>constant rank</em></strong>, meaning that its rank is the same for all $p \in \M$. Thus, if $F$ is surjective, it is a smooth submersion; if it is injective, it is a smooth immersion; and if it is bijective, it is a diffeomorphism.</p>
<p><strong>Example 19 (The orthogonal group).</strong> A real $n \times n$ matrix $\b{A}$ is said to be <strong><em>orthogonal</em></strong> if it preserves the Euclidean dot product as a linear map:</p>
<script type="math/tex; mode=display">(\b{A} \vx) \cdot (\b{A} \vx) = \vx \cdot \vy \qquad \text{for all} \, \vx, \vy \in \R^n \, .</script>
<p>The set of all orthogonal $n \times n$ matrices $\text{O}(n)$ is a subgroup of $\GL(n, \R)$, called the <strong><em>orthogonal group</em></strong> of degree $n$.</p>
<p class="right">//</p>
<p>We would like to also study the theory of <strong><em>group representations</em></strong>, i.e. asking the question whether all Lie groups can be realized as Lie subgroups of $\GL(n, \R)$ or $\GL(n, \C)$. If $\G$ is a Lie group, a <strong><em>representation</em></strong> of $\G$ is a Lie group homomorphism from $\G$ to $\GL(V)$ for some finite-dimensional vector space $V$. Note that, $\GL(V)$ denotes the group of invertible linear transformations of $V$ which is a Lie group isomorphic to $\GL(n, \R)$. If a representation is injective, it is said to be <strong><em>faithful</em></strong>.</p>
<p>There is a close connection between representations and group actions. An action of $\G$ on $V$ is said to be a <strong><em>linear action</em></strong> if for each $g \in \G$, the map $V \to V$ defined by $x \mapsto g \cdot x$ is linear.</p>
<p><strong>Example 20 (Linear action).</strong> If $\rho: \G \to \GL(V)$ is a representation of $\G$, there is an associated smooth linear action of $\G$ on $V$ given by $g \cdot x = \rho(g) x$. In fact, this holds for every linear action.</p>
<p class="right">//</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
<li>Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.</li>
<li>Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).</li>
<li>Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.</li>
<li>Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.</li>
<li>Graphics: <a href="https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane">https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane</a>.</li>
</ol>
Fri, 22 Feb 2019 12:00:00 +0100
http://wiseodd.github.io/techblog/2019/02/22/riemannian-geometry/
http://wiseodd.github.io/techblog/2019/02/22/riemannian-geometry/mathtechblogMinkowski's, Dirichlet's, and Two Squares Theorem<p><img src="/img/2018-07-24-minkowski-dirichlet/forest.svg" alt="Forest" height="250px" width="250px" /></p>
<p>Suppose we are standing at the origin of bounded regular forest in \( \mathbb{R}^2 \), with diameter of \(26\)m, and all the trees inside have diameter of \(0.16\)m. Can we see outside this forest? This problem can be solved using Minkowski’s Theorem. We will see the theorem itself first, and we shall see how can we answer that question. Furthermore, Minkowski’s Theorem can also be applied to answer two other famous theorems, Dirichlet’s Approximation Theorem, and Two Squares Theorem.</p>
<p><strong>Theorem 1 (Minkowski’s Theorem)</strong><br />
Let \( C \subseteq \mathbb{R}^d \) be symmetric around the origin, convex, and bounded set. If \( \text{vol}(C) > 2^d \) then \( C \) contains at least one lattice point different from the origin.</p>
<p><em>Proof.</em> Let \( C’ := \frac{1}{2} C = \{ \frac{1}{2} c \, \vert \, c \in C \} \). Assume that there exists non-zero integer \( v \in \mathbb{Z}^d \setminus \{ 0 \} \), such that the intersection between \( C’ \) and its translation wrt. \( v \) is non-empty.</p>
<p>Pick arbitrary \( x \in C’ \cap (C’ + v) \). Then \( x - v \in C’ \) by construction. By symmetry, \( v - x \in C’ \). As \( C’ \) is convex, then line segment between \( x \) and \( v - x \) is in \( C’ \). We particularly consider the midpoint of the line segment: \( \frac{1}{2}x + \frac{1}{2} (v - x) = \frac{1}{2} v \in C’ \). This immediately implies that \( v \in C \) by the definition of \( C’ \), which proves the theorem.</p>
<p class="right">\( \square \)</p>
<p>The claim that there exists non-zero integer \( v \in \mathbb{Z}^d \setminus \{ 0 \} \), such that \( C’ \cap (C’ + v) \neq \emptyset \) is not proven in this post. One can refer to Matoušek’s book for the proof.</p>
<p><img src="/img/2018-07-24-minkowski-dirichlet/forest_minkowski.svg" alt="Minkowsi_forest" height="250px" width="250px" /></p>
<p>Given Minkowski’s Theorem, now we can answer our original question. We assume the trees are just lattice points, and our visibility line is now a visibility strip, which has wide of \( 0.16 \)m and length of \( 26 \)m. We note that the preconditions of Minkowski’s Theorem are satisfied by this visibility strip, which has the volume of \( \approx 4.16 > 4 = 2^d \). Therefore, there exists a lattice point other than the origin inside our visibility strip. Thus our vision outside is blocked by the tree.</p>
<p>Now we look at two theorems that can be proven using Minkowski’s Theorem. The first one is about approximation of real number with a rational.</p>
<p><strong>Theorem 2 (Dirichlet’s Approximation Theorem)</strong><br />
Let \( \alpha \in \mathbb{R} \). Then for all \( N \in \mathbb{N} \), there exists \( m \in \mathbb{Z}, n \in \mathbb{N} \) with \( n \leq N \) such that:</p>
<script type="math/tex; mode=display">\left \vert \, \alpha - \frac{m}{n} \right \vert \lt \frac{1}{nN} \enspace .</script>
<p><em>Proof.</em> Consider \( C := \{ (x, y) \in \mathbb{R}^2 \, \vert \, -N-\frac{1}{2} \leq x \leq N+\frac{1}{2}, \vert \alpha x - y \vert \lt \frac{1}{N} \} \). By inspection on the figure below, we can observe that \( C \) is convex, bounded, and symmetric around the origin.</p>
<p><img src="/img/2018-07-24-minkowski-dirichlet/dirichlet.svg" alt="Dirichlet" height="400px" width="400px" /></p>
<p>Observe also that the area of \( C \) is \( \text{vol}(C) = \frac{2}{N} (2N + 1) = 4 + \frac{2}{N} \gt 4 = 2^d \). Thus this construction satisfied the Minkowski’s Theorem’s preconditions. Therefore there exists lattice point \( (n, m) \neq (0, 0) \). As \( C \) is symmetric, we can always assume \( n \gt 0 \) thus \( n \in \mathbb{N} \). By definition of \( C \), \( n \leq N+\frac{1}{2} \implies n \leq N \) as \( N \in \mathbb{N} \). Futhermore, we have \( \vert \alpha n - m \vert \lt \frac{1}{N} \). This implies \( \left\vert \alpha - \frac{m}{n} \right\vert \lt \frac{1}{nN} \) which conclude the proof.</p>
<p class="right">\( \square \)</p>
<p>Our second application is the theorem saying that prime number \( p \equiv 1 \, (\text{mod } 4) \) can be written as a sum of two squares. For this we need the General Minkowski’s Theorem, which allows us to use arbitrary basis for our lattice.</p>
<p><strong>Theorem 3 (General Minkowski’s Theorem)</strong>
Let \( C \subseteq \mathbb{R}^d \) be symmetric around the origin, convex, and bounded set. Let \( \Gamma \) be the lattice in \( \mathbb{R}^d \). If \( \text{vol}(C) > 2^d \,\text{vol}(\Gamma) = 2^d \det \Gamma \), then \( C \) contains at least one lattice point in \( \Gamma \) different from the origin.</p>
<p class="right">\( \square \)</p>
<p><strong>Theorem 4 (Two Squares Theorem)</strong><br />
Every prime number \( p \equiv 1 \, (\text{mod } 4) \) can be written by the sum of two squares \( p = a^2 + b^2 \) where \( a, b \in \mathbb{Z} \).</p>
<p><em>Proof.</em> We need intermediate result which will not be proven here (refer to [1] for the proof): \( -1 \) is a quadratic residue modulo \( p \), that is, there exists \( q \lt p \) such that \( q^2 \equiv -1 \, (\text{mod } p) \).</p>
<p>Fix \( q \) and take the following basis for our lattice: \( z_1 := (1, q), \, z_2 := (0, p) \). The volume of this lattice is: \( \det \Gamma = \det \begin{bmatrix} 1 & 0 \\ q & p \end{bmatrix} = p \).</p>
<p>Define a convex, symmetric, and bounded body \( C := \{ (x, y) \in \mathbb{R}^2 \, \vert \, x^2 + y^2 \lt 2p \} \), i.e. \( C \) is an open ball around the origin with radius \( \sqrt{2p} \). Note:</p>
<script type="math/tex; mode=display">\text{vol}(C) = \pi r^2 \approx 6.28p \gt 4p = 2^2 p = 2^d \det \Gamma \enspace ,</script>
<p>thus General Minkowski’s Theorem applies and there exists a lattice point \( (a, b) = i z_1 + j z_2 = (i, iq + jp) \neq (0, 0) \). Notice:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
a^2 + b^2 &= i^2 + i^2 q^2 + 2ijpq + j^2 p^2 \\
&\equiv i^2 + i^2q^2 \, (\text{mod } p) \\
&\equiv i^2(1+q^2) \, (\text{mod } p) \\
&\equiv i^2(1-1) \, (\text{mod } p) \\
&\equiv 0 \, (\text{mod } p) \enspace .
\end{align} %]]></script>
<p>To go from 3rd to 4th line, we use our very first assumption, i.e. \( q^2 \equiv -1 \, (\text{mod } p) \). Therefore \( a^2 + b^2 \) has to be divisible by \( p \). Also, as \( (a, b) \) is in \( C \) this implies \( a^2 + b^2 \lt 2p \) by definition. Thus the only choice is \( a^2 + b^2 = p \). This proves the theorem.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Matoušek, Jiří. Lectures on discrete geometry. Vol. 212. New York: Springer, 2002.</li>
</ol>
Tue, 24 Jul 2018 20:30:00 +0200
http://wiseodd.github.io/techblog/2018/07/24/minkowski-dirichlet/
http://wiseodd.github.io/techblog/2018/07/24/minkowski-dirichlet/mathtechblogReduced Betti number of sphere: Mayer-Vietoris Theorem<p>In the <a href="/techblog/2018/07/18/brouwers-fixed-point/">previous post</a> about Brouwer’s Fixed Point Theorem, we used two black boxes. In this post we will prove the slight variation of those black boxes. We will start with the simplest lemma first: the reduced homology of balls.</p>
<p><strong>Lemma 2 (Reduced homology of balls)</strong><br />
Given a \( d \)-ball \( \mathbb{B}^d \), then its reduced \( p \)-th homology space is trivial, i.e. \(\tilde{H}_p(\mathbb{B}^d) = 0 \), for any \( d \) and \( p \).</p>
<p><em>Proof.</em> Observe that \( \mathbb{B}^d \) is contractible, i.e. homotopy equivalent to a point. Assuming we use coefficient \( \mathbb{Q} \), we know the zero-th homology space of point is \( H_0(\, \cdot \,, \mathbb{Q}) = \mathbb{Q} \), and trivial otherwise, i.e. \( H_p (\, \cdot \,, \mathbb{Q}) = 0 \enspace \forall p \geq 1 \).</p>
<p>In the reduced homology, therefore \( \tilde{H}_0(\, \cdot \,, \mathbb{Q}) = 0 \). Thus the reduced homology of balls is trivial for all \( d, p \).</p>
<p class="right">\( \square \)</p>
<p><strong>Corollary 1 (Reduced Betti numbers of balls)</strong><br />
The \( p \)-th reduced Betti numbers of \( \mathbb{B}^d \) is zero for all \(d, p\).</p>
<p class="right">\( \square \)</p>
<p>Now, we are ready to prove the main theme of this post.</p>
<p><strong>Lemma 1 (Reduced Betti numbers of spheres)</strong> <br />
Given a \( d \)-sphere \( \mathbb{S}^d \), then its \( p \)-th reduced Betti number is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tilde{\beta}_p(\mathbb{S}^d) = \begin{cases} 1, & \text{if } p = d \\ 0, & \text{otherwise} \enspace . \end{cases} %]]></script>
<p><em>Proof.</em> We use “divide-and-conquer” approach to apply Mayer-Vietoris Theorem. We cut the sphere along the equator and note that the upper and lower portion of the sphere is just a disk, and the intersection between those two parts is a circle (sphere one dimension down), as shown in the figure below.</p>
<p><img src="/img/2018-07-23-mayer-vietoris-sphere/sphere.svg" alt="Sphere" height="350px" width="350px" /></p>
<p>By Mayer-Vietoris Theorem, we have a long exact sequence in the form of:</p>
<script type="math/tex; mode=display">\dots \longrightarrow \tilde{H}_p(\mathbb{S}^{d-1}) \longrightarrow \tilde{H}_p(\mathbb{B}^d) \oplus \tilde{H}_p(\mathbb{B}^d) \longrightarrow \tilde{H}_p(\mathbb{S}^d) \longrightarrow \tilde{H}_{p-1}(\mathbb{S}^{d-1}) \longrightarrow \dots \enspace .</script>
<p>By Corollary 1, \( \tilde{H}_p(\mathbb{B}^d) \oplus \tilde{H}_p(\mathbb{B}^d) = \tilde{H}_{p-1}(\mathbb{B}^d) \oplus \tilde{H}_{p-1}(\mathbb{B}^d) = 0 \). As the sequence is exact, therefore \( \tilde{H}_p(\mathbb{S}^d) \longrightarrow \tilde{H}_{p-1}(\mathbb{S}^{d-1}) \) is a bijection, and thus an isomorphism. Then by induction with base case of \( \mathbb{S}^0 \), we conclude that the claim holds.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Hatcher, Allen. “Algebraic topology.” (2001).</li>
</ol>
Mon, 23 Jul 2018 10:00:00 +0200
http://wiseodd.github.io/techblog/2018/07/23/mayer-vietoris-sphere/
http://wiseodd.github.io/techblog/2018/07/23/mayer-vietoris-sphere/mathtechblogBrouwer's Fixed Point Theorem: A Proof with Reduced Homology<p>This post is about the proof I found very interesting during the Topology course I took this semester. It highlights the application of Reduced Homology, which is a modification of Homology theory in Algebraic Topology. We will use two results from Reduced Homology as black-boxes for the proof. Everywhere, we will assume \( \mathbb{Q} \) is used as the coefficient of the Homology space.</p>
<p><strong>Lemma 1 (Reduced Homology of spheres)</strong>
Given a \( d \)-sphere \( \mathbb{S}^d \), then its reduced \( p \)-th Homology space is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tilde{H}_p(\mathbb{S}^d) = \begin{cases} \mathbb{Q}, & \text{if } p = d \\ 0, & \text{otherwise} \enspace . \end{cases} %]]></script>
<p class="right">\( \square \)</p>
<p><strong>Lemma 2 (Reduced Homology of balls)</strong>
Given a \( d \)-ball \( \mathbb{B}^d \), then its reduced \( p \)-th Homology space is trivial, i.e. \(\tilde{H}_p(\mathbb{B}^d) = 0 \), for any \( d \) and \( p \).</p>
<p class="right">\( \square \)</p>
<p>Equipped with these lemmas, we are ready to prove the special case of Brouwer’s Fixed Point Theorem, where we consider map from a ball to itself.</p>
<p><strong>Brouwer’s Fixed Point Theorem</strong>
Given \( f: \mathbb{B}^{d+1} \to \mathbb{B}^{d+1} \) continuous, then there exists \( x
\in \mathbb{B}^{d+1} \) such that \( f(x) = x \).</p>
<p><em>Proof.</em> For contradiction, assume \( \forall x \in \mathbb{B}^{d+1}: f(x) \neq x \). We construct a map \( r: \mathbb{B}^{d+1} \to \mathbb{S}^d \), casting ray from the ball to its shell by extending the line segment between \( x \) and \( f(x) \).</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/map_r.svg" alt="Map r" height="200px" width="200px" /></p>
<p>Observe that \( r(x) \) is continuous because \( f(x) \) is. Also, \( x \in \mathbb{S}^d \implies r(x) = x \). Therefore we have the following commutative diagram.</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/comm_diag.svg" alt="Commutative Diagram" height="200px" width="200px" /></p>
<p>Above, \( i \) is inclusion map, and \( id \) is identity map. We then look of the Reduced Homology of the above, and this gives us the following commutative diagram.</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/comm_diag_hom.svg" alt="Commutative Diagram Homology" height="275px" width="275px" /></p>
<p>As the diagram commute, then \( \tilde{H}_d(\mathbb{S}^d) \xrightarrow{i^*} \tilde{H}_d(\mathbb{B}^{d+1}) \xrightarrow{r^*} \tilde{H}_d(\mathbb{S}^d) \) should be identity map on \( \tilde{H}_d(\mathbb{S}^d) \). By Lemma 2, \( \tilde{H}_d(\mathbb{B}^{d+1}) = 0 \). This implies \( \tilde{H}_d(\mathbb{S}^d) = 0 \). But this is a contradiction, as By Lemma 1, \( \tilde{H}_d(\mathbb{S}^d) = \mathbb{Q} \). Therefore there must be a fixed point.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Hatcher, Allen. “Algebraic topology.” (2001).</li>
</ol>
Wed, 18 Jul 2018 10:00:00 +0200
http://wiseodd.github.io/techblog/2018/07/18/brouwers-fixed-point/
http://wiseodd.github.io/techblog/2018/07/18/brouwers-fixed-point/mathtechblogNatural Gradient Descent<p><a href="/techblog/2018/03/11/fisher-information/">Previously</a>, we looked at the Fisher Information Matrix. We saw that it is equal to the negative expected Hessian of log likelihood. Thus, the immediate application of Fisher Information Matrix is as drop-in replacement of Hessian in second order optimization algorithm. In this article, we will look deeper at the intuition on what excatly is the Fisher Information Matrix represents and what is the interpretation of it.</p>
<h2 class="section-heading">Distribution Space</h2>
<p>As per previous article, we have a probabilistic model represented by its likelihood \( p(x \vert \theta) \). We want to maximize this likelihood function to find the most likely parameter \( \theta \). Equivalent formulation would be to minimize the loss function \( \mathcal{L}(\theta) \), which is the negative log likelihood.</p>
<p>Usual way to solve this optimization is to use gradient descent. In this case, we are taking step which direction is given by \( -\nabla_\theta \mathcal{L}(\theta) \). This is the steepest descent direction around the local neighbourhood of the current value of \( \theta \) in the parameter space. Formally, we have</p>
<script type="math/tex; mode=display">\frac{-\nabla_\theta \mathcal{L}(\theta)}{\lVert \nabla_\theta \mathcal{L}(\theta) \rVert} = \lim_{\epsilon \to 0} \frac{1}{\epsilon} \mathop{\text{arg min}}_{d \text{ s.t. } \lVert d \rVert \leq \epsilon} \mathcal{L}(\theta + d) \, .</script>
<p>The above expression is saying that the steepest descent direction in parameter space is to pick a vector \( d \), such that the new parameter \( \theta + d \) is within the \( \epsilon \)-neighbourhood of the current parameter \( \theta \), and we pick \( d \) that minimize the loss. Notice the way we express this neighbourhood is by the means of Euclidean norm. Thus, the optimization in gradient descent is dependent to the Euclidean geometry of the parameter space.</p>
<p>Meanwhile, if our objective is to minimize the loss function (maximizing the likelihood), then it is natural that we taking step in the space of all possible likelihood, realizable by parameter \( \theta \). As the likelihood function itself is a probability distribution, we call this space distribution space. Thus it makes sense to take the steepest descent direction in this distribution space instead of parameter space.</p>
<p>Which metric/distance then do we need to use in this space? A popular choice would be KL-divergence. KL-divergence measure the “closeness” of two distributions. Although as KL-divergence is non-symmetric and thus not a true metric, we can use it anyway. This is because as \( d \) goes to zero, KL-divergence is asymptotically symmetric. So, within a local neighbourhood, KL-divergence is approximately symmetric [1].</p>
<p>We can see the problem when using only Euclidean metric in parameter space from the illustrations below. Consider a Gaussian parameterized by only its mean and keep the variance fixed to 2 and 0.5 for the first and second image respectively:</p>
<p><img src="/img/2018-03-14-natural-gradient/param_space_dist.png" alt="Param1" /></p>
<p><img src="/img/2018-03-14-natural-gradient/param_space_dist2.png" alt="Param2" /></p>
<p>In both images, the distance of those Gaussians are the same, i.e. 4, according to Euclidean metric (red line). However, clearly in distribution space, i.e. when we are taking into account the shape of the Gaussians, the distance is different in the first and second image. In the first image, the KL-divergence should be lower as there is more overlap between those Gaussians. Therefore, if we only work in parameter space, we cannot take into account this information about the distribution realized by the parameter.</p>
<!-- The other nice property of working in distribution space instead of parameter space is that in distribution space, it is invariant to parameterization of the distribution. As an illustration, consider a Gaussian. We can parametrize it with its covariance matrix or precision matrix. Covariance and precision matrix are different to each other (up to special condition, e.g. identity matrix), even though it induces the same Gaussian. Thus, a single point in distribution space are possibly mapped into two different points in the parameter space. If we work in distribution space, then we only care about the resulting Gaussian. -->
<h2 class="section-heading">Fisher Information and KL-divergence</h2>
<p>One question still needs to be answered is what exactly is the connection between Fisher Information Matrix and KL-divergence? It turns out, Fisher Information Matrix defines the local curvature in distribution space for which KL-divergence is the metric.</p>
<p><strong>Claim:</strong>
Fisher Information Matrix \( \text{F} \) is the Hessian of KL-divergence between two distributions \( p(x \vert \theta) \) and \( p(x \vert \theta’) \), with respect to \( \theta’ \), evaluated at \( \theta’ = \theta \).</p>
<p><em>Proof.</em> KL-divergence can be decomposed into entropy and cross-entropy term, i.e.:</p>
<script type="math/tex; mode=display">\text{KL} [p(x \vert \theta) \, \Vert \, p(x \vert \theta')] = \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta) ] - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta') ] \, .</script>
<p>The first derivative wrt. \( \theta’ \) is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_{\theta'} \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')] &= \nabla_{\theta'} \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta) ] - \nabla_{\theta'} \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta') ] \\[5pt]
&= - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \nabla_{\theta'} \log p(x \vert \theta') ] \\[5pt]
&= - \int p(x \vert \theta) \nabla_{\theta'} \log p(x \vert \theta') \, \text{d}x \, .
\end{align} %]]></script>
<p>The second derivative is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_{\theta'}^2 \, \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')] &= - \int p(x \vert \theta) \, \nabla_{\theta'}^2 \log p(x \vert \theta') \, \text{d}x \\[5pt]
\end{align} %]]></script>
<p>Thus, the Hessian wrt. \( \theta’ \) evaluated at \( \theta’ = \theta \) is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{H}_{\text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')]} &= - \int p(x \vert \theta) \, \left. \nabla_{\theta'}^2 \log p(x \vert \theta') \right\vert_{\theta' = \theta} \, \text{d}x \\[5pt]
&= - \int p(x \vert \theta) \, \text{H}_{\log p(x \vert \theta)} \, \text{d}x \\[5pt]
&= - \mathop{\mathbb{E}}_{p(x \vert \theta)} [\text{H}_{\log p(x \vert \theta)}] \\[5pt]
&= \text{F} \, .
\end{align} %]]></script>
<p>The last line follows from <a href="/techblog/2018/03/11/fisher-information/">the previous article about Fisher Information Matrix</a>, in which we showed that the negative expected Hessian of log likelihood is the Fisher Information Matrix.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">Steepest Descent in Distribution Space</h2>
<p>Now we are ready to use the Fisher Information Matrix to enhance the gradient descent. But first, we need to derive the Taylor series expansion for KL-divergence around \( \theta \).</p>
<p><strong>Claim:</strong>
Let \( d \to 0 \). The second order Taylor series expansion of KL-divergence is \( \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta + d)] \approx \frac{1}{2} d^\text{T} \text{F} d \).</p>
<p><em>Proof.</em> We will use \( p_{\theta} \) as a notational shortcut for \( p(x \vert \theta) \). By definition, the second order Taylor series expansion of KL-divergence is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{KL}[p_{\theta} \, \Vert \, p_{\theta + d}] &\approx \text{KL}[p_{\theta} \, \Vert \, p_{\theta}] + (\left. \nabla_{\theta'} \text{KL}[p_{\theta} \, \Vert \, p_{\theta'}] \right\vert_{\theta' = \theta})^\text{T} d + \frac{1}{2} d^\text{T} \text{F} d \\[5pt]
&= \text{KL}[p_{\theta} \, \Vert \, p_{\theta}] - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \nabla_\theta \log p(x \vert \theta) ]^\text{T} d + \frac{1}{2} d^\text{T} \text{F} d \\[5pt]
\end{align} %]]></script>
<p>Notice that the first term is zero as it is the same distribution. Furthermore, from the <a href="/techblog/2018/03/11/fisher-information/">previous article</a>, we saw that the expected value of the gradient of log likelihood, which is exactly the gradient of KL-divergence as shown in the previous proof, is also zero. Thus the only thing left is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta + d)] &\approx \frac{1}{2} d^\text{T} \text{F} d \, .
\end{align} %]]></script>
<p class="right">\( \square \)</p>
<p>Now, we would like to know what is update vector \( d \) that minimizes the loss function \( \mathcal{L} (\theta) \) in distribution space, so that we know in which direction decreases the KL-divergence the most. This is analogous to the method of steepest descent, but in distribution space with KL-divergence as metric, instead of the usual parameter space with Euclidean metric. For that, we do this minimization:</p>
<script type="math/tex; mode=display">d^* = \mathop{\text{arg min}}_{d \text{ s.t. } \text{KL}[p_\theta \Vert p_{\theta + d}] = c} \mathcal{L} (\theta + d) \, ,</script>
<p>where \( c \) is some constant. The purpose of fixing the KL-divergence to some constant is to make sure that we move along the space with constant speed, regardless the curvature. Further benefit is that this makes the algorithm more robust to the reparametrization of the model, i.e. the algorithm does not care how the model is parametrized, it only cares about the distribution induced by the parameter [3].</p>
<p>If we write the above minimization in Lagrangian form, with constraint KL-divergence approximated by its second order Taylor series expansion and approximate \( \mathcal{L}(\theta + d) \) with its first order Taylor series expansion, we get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
d^* &= \mathop{\text{arg min}}_d \, \mathcal{L} (\theta + d) + \lambda \, (\text{KL}[p_\theta \Vert p_{\theta + d}] - c) \\
&\approx \mathop{\text{arg min}}_d \, \mathcal{L}(\theta) + \nabla_\theta \mathcal{L}(\theta)^\text{T} d + \frac{1}{2} \lambda \, d^\text{T} \text{F} d - \lambda c \, .
\end{align} %]]></script>
<p>To solve this minimization, we set its derivative wrt. \( d \) to zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
0 &= \frac{\partial}{\partial d} \mathcal{L}(\theta) + \nabla_\theta \mathcal{L}(\theta)^\text{T} d + \frac{1}{2} \lambda \, d^\text{T} \text{F} d - \lambda c \\[5pt]
&= \nabla_\theta \mathcal{L}(\theta) + \lambda \, \text{F} d \\[5pt]
\lambda \, \text{F} d &= -\nabla_\theta \mathcal{L}(\theta) \\[5pt]
d &= -\frac{1}{\lambda} \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \\[5pt]
\end{align} %]]></script>
<p>Up to constant factor of \( \frac{1}{\lambda} \), we get the optimal descent direction, i.e. the opposite direction of gradient while taking into account the local curvature in distribution space defined by \( \text{F}^{-1} \). We can absorb this constant factor into the learning rate.</p>
<p><strong>Definition:</strong>
Natural gradient is defined as</p>
<script type="math/tex; mode=display">\tilde{\nabla}_\theta \mathcal{L}(\theta) = \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \, .</script>
<p class="right">\( \square \)</p>
<p>As corollary, we have the following algorithm:</p>
<p><strong>Algorithm: Natural Gradient Descent</strong></p>
<ol>
<li>Repeat:
<ol>
<li>Do forward pass on our model and compute loss \( \mathcal{L}(\theta) \).</li>
<li>Compute the gradient \( \nabla_\theta \mathcal{L}(\theta) \).</li>
<li><a href="/techblog/2018/03/11/fisher-information/">Compute the Fisher Information Matrix</a> \( \text{F} \), or its empirical version (wrt. our training data).</li>
<li>Compute the natural gradient \( \tilde{\nabla}_\theta \mathcal{L}(\theta) = \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \).</li>
<li>Update the parameter: \( \theta = \theta - \alpha \, \tilde{\nabla}_\theta \mathcal{L}(\theta) \), where \( \alpha \) is the learning rate.</li>
</ol>
</li>
<li>Until convergence.</li>
</ol>
<!-- <h2 class="section-heading">Simple Implementation Example</h2>
**Remark.** _The implementation below is based on the empirical FIM, thus does not reflect the true natural gradient. (See <https://arxiv.org/abs/1905.12558>.) To use the true FIM, one need to take the expectation of the outer product of the gradient w.r.t. the predictive distribution. For example, one can do Monte Carlo approximation by drawing random labels to compute the loss and subsequently compute the FIM. Note that the calculation of the vanilla gradient remains unchanged._
Let's consider logistic regression problem. The training data is drawn from a mixture of Gaussians centered at \\( (-1, -1) \\) and \\( (1, 1) \\). We assign different labels for each mode. The code is as follows:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.utils</span> <span class="kn">import</span> <span class="n">shuffle</span>
<span class="n">X0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">X1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">X0</span><span class="p">,</span> <span class="n">X1</span><span class="p">])</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1</span><span class="p">])])</span>
<span class="n">X</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">shuffle</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:</span><span class="mi">150</span><span class="p">],</span> <span class="n">X</span><span class="p">[</span><span class="mi">150</span><span class="p">:]</span>
<span class="n">t_train</span><span class="p">,</span> <span class="n">t_test</span> <span class="o">=</span> <span class="n">t</span><span class="p">[:</span><span class="mi">150</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">150</span><span class="p">:]</span></code></pre></figure>
Next, we consider our model. It is a simple linear model (without bias) with sigmoid output. Thus naturally, we use binary cross entropy loss:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Initialize weight
</span><span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="k">def</span> <span class="nf">sigm</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">NLL</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span>
<span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">t</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">t</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">))</span></code></pre></figure>
Inside the training loop, the forward pass looks like:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Forward
</span><span class="n">z</span> <span class="o">=</span> <span class="n">X_train</span> <span class="o">@</span> <span class="n">W</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">NLL</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">t_train</span><span class="p">)</span>
<span class="c1"># Loss
</span><span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">'Loss: {loss:.3f}'</span><span class="p">)</span></code></pre></figure>
The gradient of the loss function wrt. parameter \\( w \\) is then as follows:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">dy</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">t_train</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">m</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">))</span>
<span class="n">dz</span> <span class="o">=</span> <span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
<span class="n">dW</span> <span class="o">=</span> <span class="n">X_train</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">(</span><span class="n">dz</span> <span class="o">*</span> <span class="n">dy</span><span class="p">)</span></code></pre></figure>
At this point we are ready to do update step for vanilla gradient descent:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">W</span> <span class="o">=</span> <span class="n">W</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">dW</span></code></pre></figure>
For natural gradient descent, we need some extra works. Firstly we need to compute the gradient of log likelihood wrt. \\( w \\), without summing, as we will do this when we compute the covariance.
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">grad_loglik_z</span> <span class="o">=</span> <span class="p">(</span><span class="n">t_train</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">dz</span>
<span class="n">grad_loglik_W</span> <span class="o">=</span> <span class="n">grad_loglik_z</span> <span class="o">*</span> <span class="n">X_train</span></code></pre></figure>
The Empirical Fisher is given by the empirical covariance matrix of the gradient of log likelihood wrt. our training data:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">F</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cov</span><span class="p">(</span><span class="n">grad_loglik_W</span><span class="o">.</span><span class="n">T</span><span class="p">)</span></code></pre></figure>
To do the update step, we need to take the product of \\( \text{F}^{-1} \\) with the gradient of loss:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">W</span> <span class="o">=</span> <span class="n">W</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span><span class="n">F</span><span class="p">)</span> <span class="o">@</span> <span class="n">dW</span></code></pre></figure>
The complete script to reproduce this can be found at:
<https://gist.github.com/wiseodd/1c9f5006310f5ee03bd4682b4c03020a>.
How good is natural gradient descent compared to the vanilla gradient descent? Below are the comparison of loss value after five iterations, averaged over 100 repetitions.
| Method | Mean loss | Std. loss |
|:---|:---:|:---:|
| Natural Gradient Descent | **0.1823** | **0.0814** |
| Vanilla Gradient Descent | 0.4058 | 0.106 |
{:.table-bordered}
At least in this very simple setting, natural gradient descent converges twice as fast as the vanilla counterpart. Furthermore, it converges faster consistently, as shown by the standard deviation. -->
<h2 class="section-heading">Discussion</h2>
<p>In the above very simple model with low amount of data, we saw that we can implement natural gradient descent easily. But how easy is it to do this in the real world? As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.</p>
<p>One way to get around this problem is to approximate the Fisher/Hessian instead. Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need \( O(n) \) space to store (the approximation of) \( \text{F} \) instead of \( O(n^2) \) and the inversion can be done in \( O(n) \) instead of \( O(n^3) \). In practice ADAM works really well and is currently the <em>de facto</em> standard for optimizing deep neural networks.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Martens, James. “New insights and perspectives on the natural gradient method.” arXiv preprint arXiv:1412.1193 (2014).</li>
<li>Ly, Alexander, et al. “A tutorial on Fisher information.” Journal of Mathematical Psychology 80 (2017): 40-55.</li>
<li>Pascanu, Razvan, and Yoshua Bengio. “Revisiting natural gradient for deep networks.” arXiv preprint arXiv:1301.3584 (2013).</li>
<li>Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).</li>
</ol>
Wed, 14 Mar 2018 07:00:00 +0100
http://wiseodd.github.io/techblog/2018/03/14/natural-gradient/
http://wiseodd.github.io/techblog/2018/03/14/natural-gradient/machine learningtechblogFisher Information Matrix<p>Suppose we have a model parameterized by parameter vector \( \theta \) that models a distribution \( p(x \vert \theta) \). In frequentist statistics, the way we learn \( \theta \) is to maximize the likelihood \( p(x \vert \theta) \) wrt. parameter \( \theta \). To assess the goodness of our estimate of \( \theta \) we define a score function:</p>
<script type="math/tex; mode=display">s(\theta) = \nabla_{\theta} \log p(x \vert \theta) \, ,</script>
<p>that is, score function is the gradient of log likelihood function. The result about score function below is important building block on our discussion.</p>
<p><strong>Claim:</strong>
The expected value of score wrt. our model is zero.</p>
<p><em>Proof.</em> Below, the gradient is wrt. \( \theta \).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ s(\theta) \right] &= \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \nabla \log p(x \vert \theta) \right] \\[5pt]
&= \int \nabla \log p(x \vert \theta) \, p(x \vert \theta) \, \text{d}x \\[5pt]
&= \int \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} p(x \vert \theta) \, \text{d}x \\[5pt]
&= \int \nabla p(x \vert \theta) \, \text{d}x \\[5pt]
&= \nabla \int p(x \vert \theta) \, \text{d}x \\[5pt]
&= \nabla 1 \\[5pt]
&= 0
\end{align} %]]></script>
<p class="right">\( \square \)</p>
<p>But how certain are we to our estimate? We can define an uncertainty measure around the expected estimate. That is, we look at the covariance of score of our model. Taking the result from above:</p>
<script type="math/tex; mode=display">\mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ (s(\theta) - 0) \, (s(\theta) - 0)^{\text{T}} \right] \, .</script>
<p>We can then see it as an information. The covariance of score function above is the definition of Fisher Information. As we assume \( \theta \) is a vector, the Fisher Information is in a matrix form, called Fisher Information Matrix:</p>
<script type="math/tex; mode=display">\text{F} = \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \nabla \log p(x \vert \theta) \, \nabla \log p(x \vert \theta)^{\text{T}} \right] \, .</script>
<p>However, usually our likelihood function is complicated and computing the expectation is intractable. We can approximate the expectation in \( \text{F} \) using empirical distribution \( \hat{q}(x) \), which is given by our training data \( X = \{ x_1, x_2, \cdots, x_N \} \). In this form, \( \text{F} \) is called Empirical Fisher:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{F} &= \frac{1}{N} \sum_{i=1}^{N} \nabla \log p(x_i \vert \theta) \, \nabla \log p(x_i \vert \theta)^{\text{T}} \, .
\end{align} %]]></script>
<h2 class="section-heading">Fisher and Hessian</h2>
<p>One property of \( \text{F} \) that is not obvious is that it has the interpretation of being the negative expected Hessian of our model’s log likelihood.</p>
<p><strong>Claim:</strong>
The negative expected Hessian of log likelihood is equal to the Fisher Information Matrix \( \text{F} \).</p>
<p><em>Proof.</em> The Hessian of the log likelihood is given by the Jacobian of its gradient:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{H}_{\log p(x \vert \theta)} &= \text{J} \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} \right) \\[5pt]
&= \frac{ \text{H}_{p(x \vert \theta)} \, p(x \vert \theta) - \nabla p(x \vert \theta) \, \nabla p(x \vert \theta)^{\text{T}}}{p(x \vert \theta) \, p(x \vert \theta)} \\[5pt]
&= \frac{\text{H}_{p(x \vert \theta)} \, p(x \vert \theta)}{p(x \vert \theta) \, p(x \vert \theta)} - \frac{\nabla p(x \vert \theta) \, \nabla p(x \vert \theta)^{\text{T}}}{p(x \vert \theta) \, p(x \vert \theta)} \\[5pt]
&= \frac{\text{H}_{p(x \vert \theta)}}{p(x \vert \theta)} - \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} \right) \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)}\right)^{\text{T}} \, ,
\end{align} %]]></script>
<p>where the second line is a result of applying quotient rule of derivative. Taking expectation wrt. our model, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \text{H}_{\log p(x \vert \theta)} \right] &= \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \frac{\text{H}_{p(x \vert \theta)}}{p(x \vert \theta)} - \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} \right) \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} \right)^{\text{T}} \right] \\[5pt]
&= \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \frac{\text{H}_{p(x \vert \theta)}}{p(x \vert \theta)} \right] - \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)} \right) \left( \frac{\nabla p(x \vert \theta)}{p(x \vert \theta)}\right)^{\text{T}} \right] \\[5pt]
&= \int \frac{\text{H}_{p(x \vert \theta)}}{p(x \vert \theta)} p(x \vert \theta) \, \text{d}x \, - \mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \nabla \log p(x \vert \theta) \, \nabla \log p(x \vert \theta)^{\text{T}} \right] \\[5pt]
&= \text{H}_{\int p(x \vert \theta) \, \text{d}x} \, - \text{F} \\[5pt]
&= \text{H}_{1} - \text{F} \\[5pt]
&= -\text{F} \, .
\end{align} %]]></script>
<p>Thus we have \( \text{F} = -\mathop{\mathbb{E}}_{p(x \vert \theta)} \left[ \text{H}_{\log p(x \vert \theta)} \right] \).</p>
<p class="right">\( \square \)</p>
<p>Indeed knowing this result, we can see the role of \( \text{F} \) as a measure of curvature of the log likelihood function.</p>
<h2 class="section-heading">Conclusion</h2>
<p>Fisher Information Matrix is defined as the covariance of score function. It is a curvature matrix and has interpretation as the negative expected Hessian of log likelihood function. Thus the immediate application of \( \text{F} \) is as drop-in replacement of \( \text{H} \) in second order optimization methods.</p>
<p>One of the most exciting results of \( \text{F} \) is that it has connection to KL-divergence. This gives rise to natural gradient method, which we shall discuss further in the next article.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Martens, James. “New insights and perspectives on the natural gradient method.” arXiv preprint arXiv:1412.1193 (2014).</li>
<li>Ly, Alexander, et al. “A tutorial on Fisher information.” Journal of Mathematical Psychology 80 (2017): 40-55.</li>
</ol>
Sun, 11 Mar 2018 07:00:00 +0100
http://wiseodd.github.io/techblog/2018/03/11/fisher-information/
http://wiseodd.github.io/techblog/2018/03/11/fisher-information/machine learningtechblog