2.3: the Chain Rule

$\newcommand{\R}{\mathbb R }$ $\newcommand{\N}{\mathbb N }$ $\newcommand{\Z}{\mathbb Z }$ $\newcommand{\bfa}{\mathbf a}$ $\newcommand{\bfb}{\mathbf b}$ $\newcommand{\bff}{\mathbf f}$ $\newcommand{\bfg}{\mathbf g}$ $\newcommand{\bfG}{\mathbf G}$ $\newcommand{\bfh}{\mathbf h}$ $\newcommand{\bfu}{\mathbf u}$ $\newcommand{\bfx}{\mathbf x}$ $\newcommand{\bfy}{\mathbf y}$ $\newcommand{\ep}{\varepsilon}$

the Chain Rule

  1. the Chain Rule
  2. Some important special cases
  3. Some examples
  4. Proof of the Chain Rule (Optional)
  5. Be Careful!
  6. Problems

Statement of the Chain Rule

Suppose that $S$ and $T$ are open subsets of $\R^n$ and $\R^m$, and that we are given functions $\bfg: S\to \R^m$ and $\bff:T\to \R^\ell$. Assume also that $\bfa\in S$ is a point such that $\bfg(\bfa)\in T$, so that $\bff (\bfg(\bfa))$ is well-defined.

Theorem 1: the Chain Rule. Assume that $S$ and $T$ are open subsets of $\R^n$ and $\R^m$, and that we are given functions $\bfg: S\to \R^m$ and $\bff:T\to \R^\ell$. Assume also that $\bfa\in S$ is a point such that $\bfg(\bfa)\in T$; thus $\bff\circ \bfg(\bfx) = \bff (\bfg(\bfx))$ is well-defined for all $\bfx$ close to $\bfa$.
If $\bfg$ is differentiable at $\bfa$ and $\bff$ is differentiable at $\bfg(\bfa)$, then the composite function $\bff\circ \bfg$ is differentiable at $\bfa$, and \begin{equation}\label{cr1} \boxed{ D(\bff\circ\bfg)(\bfa) = D\bff(\bfg(\bfa)) \ D\bfg(\bfa).} \end{equation}

It is sometimes helpful to write out \eqref{cr1} in terms of components and partial derivatives. In doing this, we will write $\bff$ as a function of variables $\bfy = (y_1,\ldots, y_m)\in \R^m$, and $\bfg$ as a function of $\bfx = (x_1,\ldots, x_n)\in \R^n$ Then \eqref{cr1} is the same as \begin{align}\label{crcoord} \frac {\partial }{\partial x_j} (f_k\circ \bfg)(\bfa) &= \sum_{i=1}^m \frac{\partial f_k}{\partial y_i}(\bfg(\bfa)) \ \frac{\partial g_i}{\partial x_j}(\bfa)\ \ \qquad\mbox{ for }k=1,\dots, \ell \mbox{ and }j=1,\ldots, n . \ \end{align} This can be checked simply by writing out both sides of \eqref{cr1} --- the left-hand side is the $(k,j)$ component of the matrix $D(\bff\circ \bfg)(\bfa)$, and the right-hand side is the $(k,j)$ component of the matrix product $D\bff(\bfg(\bfa)) \ D\bfg(\bfa)$.

The chain rule is also sometimes written in the following way. As above, let's write $\bfx = (x_1,\ldots, x_n)$ and $\bfy = (y_1,\ldots, y_m)$ to denote typical points in $\R^n$ and $\R^m$. We can also write ${\bf u} = (u_1,\ldots, u_\ell)$ to denote a typical point in $\R^\ell$. If we suppose that the $\bfx, \bfy$ and $\bf u$ variables are related by $$ \bfy = \bfg(\bfx), \qquad {\bf u} = \bff(\bfy) = \bff(\bfg(\bfx)), $$ then it is traditional to write, for example, $\frac{\partial u_k}{\partial x_j} $ to denote the infinitessimal change in the $k$th component of $\bf u$ in response to an infinitessimal change in $x_j$, that is, $\frac{\partial u_k}{\partial x_j} = \frac{\partial } {\partial x_j}( f_k\circ \bfg)$. Using this notation, and with similar interpretations for $\frac{\partial u_k}{\partial y_i}$ and $\frac{\partial y_i}{\partial x_j}$, we can write the chain rule in the form \begin{equation}\label{cr.trad} \boxed{ \ \ \frac{\partial u_k}{\partial x_j} = \frac{\partial u_k}{\partial y_1}\frac{\partial y_1}{\partial x_j} + \cdots +\frac{\partial u_k}{\partial y_m}\frac{\partial y_m}{\partial x_j}\ \ } \end{equation} for $k=1,\dots, \ell$ and $j=1,\ldots, n$. We emphasize that this is just a rewriting of the chain rule in suggestive notation, and its actual meaning is identical to that of \eqref{crcoord}.

Some important special cases

The case $\ell=1$.

We most often apply the chain rule to compositions $f\circ \bfg$, where $f$ is a scalar function. (This is the case $k=1$, in the above notation.) In this case formula \eqref{cr1} becomes simply \begin{equation}\label{cr.scalar} D(f\circ\bfg)(\bfa) = Df(\bfg(\bfa)) \ D\bfg(\bfa). \end{equation} where $Df$ is a $1\times m$ matrix, that is, a row vector, and $D(f\circ \bfg)$ is a $1\times n$ matrix, also a row vector (though of a different length). Also, in this situation the alternate notation \eqref{cr.trad} reduces to $$ \frac{\partial u}{\partial x_j} = \frac{\partial u}{\partial y_1}\frac{\partial y_1}{\partial x_j} + \cdots +\frac{\partial u}{\partial y_m}\frac{\partial y_m}{\partial x_j}\ \quad \mbox{ for }j=1,\ldots, n. $$

The general form \eqref{cr1} of the chain rule just says that for a vector function $\bff$, every component $f_k$ satisfies \eqref{cr.scalar}, for $k=1,\ldots, \ell$.

The case $\ell=n=1$.

Specializing still more, a case that arises rather often is when $\bfg:\R \to \R^m$ and $f:\R^m\to \R$. Then $f\circ \bfg$ is a function $\R\to \R$, and the chain rule states that \begin{align}\label{crsc1} \frac{d}{dt} (f\circ \bfg)(t) &= \sum_{j=1}^m \frac{\partial f}{\partial x_j}(\bfg(t)) \frac{d g_j}{dt}(t) \ = \ \nabla f(\bfg(t)) \cdot \bfg'(t). \end{align} If we write this in the condensed notation used in \eqref{cr.trad} above, with variables $u$ and $\bfx = (x_1,\ldots, x_m)$ and $t$ related by $u = f(\bfx)$ and $\bfx = \bfg(t)$, then we get $$ \frac{d u}{dt} = \frac {\partial u}{\partial x_1}\frac{d x_1}{dt} +\cdots + \frac {\partial u}{\partial x_m}\frac{d x_m}{dt}. $$ This means the same as \eqref{crsc1}, but you may find that it is easier to remember when written this way.

See Example 2 below for an illustration of this special case.

Some examples

Example 1: polar coordinates

Let $$ S := \{ (r,\theta)\in \R^2 : r>0 \} $$ and let's define $\bfg:S\to \R^2$ by $$ \bfg(r,\theta) =(r\cos \theta, r\sin \theta) $$ (Remember $(r,\theta)$ and $(r\cos \theta, r\sin \theta)$ are understood to mean the column vectors $\binom r \theta$ and $\binom {r\cos\theta} {r\sin\theta}$,, even when we do not write them that way; see Section 0.2.4.)

The geometric meaning of this is: if $(x,y) = \bfg(r,\theta)$, then $r$ is the distance between $(x,y)$ and the origin, and $\theta$ is the angle between the positive $x$-axis and the line from the origin to $(x,y)$.

drawing

Now suppose that we are given a function $f:\R^2\to \R$. Let's write $u$ to denote the composite function $\phi = f\circ \bfg$, so $$ \phi(r,\theta) = f(r\cos\theta, r\sin\theta). $$ Then $$ D\bfg = \left( \begin{array}{rr} \cos \theta & -r\sin\theta\\ \sin\theta & r\cos\theta\end{array} \right) $$ So the Chain Rule says that $$ D\phi = \big( \partial_r \phi \ \ \ \partial_\theta \phi) = (\partial_x f \ \ \partial_y f) \left( \begin{array}{rr} \cos \theta & -r\sin\theta\\ \sin\theta & r\cos\theta\end{array} \right) $$ where we have to remember that $\partial_x f$ and $\partial_y f$ are evaluated at $\bfg(r,\theta)$. We can write this out in more detail as \begin{align} \partial_r \phi = \partial_r (f\circ \bfg) &= \partial_x f(r\cos\theta,r\sin\theta) \cos \theta + \partial_y f(r\cos\theta,r\sin\theta) \sin \theta , \nonumber \\ \partial_\theta \phi = \partial_\theta (f\circ \bfg) &= -\partial_x f(r\cos\theta,r\sin\theta) r\sin \theta + \partial_y f(r\cos\theta,r\sin\theta) r\cos \theta \nonumber \end{align}

If we use the notation \eqref{cr.trad}, then the chain rule takes the form \begin{align} \frac {\partial u} {\partial r} &= \frac{ \partial u}{\partial x} \frac{ \partial x}{\partial r} + \frac{ \partial u}{\partial y} \frac{ \partial y}{\partial r} \nonumber \\
&= \frac{ \partial u}{\partial x}\ \cos \theta + \frac{ \partial u}{\partial y}\ \sin \theta\nonumber \\ \frac {\partial u} {\partial \theta} &= \frac{ \partial u}{\partial x} \frac{ \partial x}{\partial \theta} + \frac{ \partial u}{\partial y} \frac{ \partial y}{\partial \theta} \nonumber \\
&=- \frac{ \partial u}{\partial x} \ r \sin \theta + \frac{ \partial u}{\partial y} \ r\cos\theta \nonumber \end{align}

The formulas for $\partial_r\phi$ and $\frac{\partial u}{\partial r}$ mean exactly the same thing; the differences are mainly choices of notation. One significant difference is that we have written for example $\partial_x f(r\cos\theta, r\sin \theta)$ in the first and $\frac{\partial u}{\partial x}$ in the second, and similarly for the $y$ derivatives. In particular, in the first formula, we have explicitly indicated that derivatives of $f$ should be evaluated at $g(r,\theta)$, whereas in the second, we have left it to the reader to figure this out.

Example 2. Let $\bfg:\R\to \R^n$ be a differentiable function, and consider $$ \phi(t) := |\bfg(t)| = f(\bfg(t))\qquad\mbox{ for }\quad f(\bfx) = |\bfx|. $$ If we think of $\bfg(t)$ as being the position at time $t$ of a particle that is moving aroung in $\R^n$, then $\phi(t)$ is the distance at time $t$ of the particle from the origin.

Exercise: If you have not already done it, check that $f$ is differentiable everywhere except at the origin, and that $$ \nabla f(\bfx) = \frac{\bfx}{|\bfx|}\qquad\mbox{ for }\bfx\ne {\bf 0}. $$

a digression - geometric meaning of the formula for $\nabla f$ Recall what we know about the meaning of $\nabla f(\bfx)$: it is a vector that points in the direction in which $f$ increases most rapidly near $\bfx$, and its magnitude is the rate of increase in that direction. At a point $\bfx$, if I want to increase the distance from the origin as quickly as possible, I should move directly away from the origin, that is, in the direction parallel to $\bfx$ (the vector from the origin to the point where I am located.) So $\nabla f(\bfx)$ should have the form $\lambda \bfx$ for some $\lambda\ge 0$. And if I move in this direction, I will see my distance from the origin increase at exactly a rate of 1 unit of distance per unit of distance travelled. So the rate of increase is $1$, and thus $|\nabla f(\bfx)|$ should equal $1$. Putting these together, we conclude that $\nabla f(\bfx)$ should equal $\frac \bfx{| \bfx|}$. Of course, this is not a proof but it may be helpful for our understanding.

Then using the chain rule \eqref{crsc1} and properties of the dot product, we find that $$ \frac d{dt} |\bfg(t)| = \frac{\bfg(t)}{|\bfg(t)|}\cdot \bfg'(t) = |\bfg'(t)| \cos \theta $$ where $\theta$ is the angle between $\frac{\bfg(t)}{|\bfg(t)|}$ and $\bfg'(t)$. This says that the rate at which distance from ${\bf 0 }$ is changing equals the speed $|\bfg'(t)|$ at which the particle is moving multiplied by the cosine of the angle between the direction of motion and the direction pointing exactly away from the origin.

Example 3: homogeneous functions

A function $f:\R^n\to \R$ is said to be homogeneous of degree $\alpha$ if $$ f(\lambda \bfx) = \lambda^\alpha f(\bfx)\quad\mbox{ for all }\bfx\ne{\bf 0}\mbox{ and }\lambda>0. $$ These were introduced in one of the problems in Section 2.1 For example, a monomial $x^ay^bz^c$ is homogeneous of degree $\alpha = a+b+c$. The definition of homogeneous also applies if the domain of $f$ does not include the origin, and for the present discussion, it does not matter whether or not $f({\bf 0})$ is defined.

An interesting fact about homogeneous functions can be proved using the chain rule.

Let's fix a nonzero vector $\bfx\in \R^n$ and define $\bfg:(0,\infty)\to \R^n$ and $h:(0,\infty)\to \R$ by $$ \bfg(\lambda) = \lambda \bfx, \qquad\qquad h(\lambda) = f(\bfg(\lambda)) = f(\lambda \bfx). $$ We know that $h(\lambda) = \lambda^\alpha f(\bfx)$, so clearly $$ h'(\lambda) = \alpha \lambda^{\alpha-1} f(\bfx). $$ On the other hand, we can compute $h' = (f\circ \bfg)'$ using the chain rule. This leads to $$ h'(\lambda) = \nabla f(\lambda \bfx) \cdot \bfx $$ since clearly $\bfg'(\lambda)=\bfx$. Setting $\lambda = 1$ and equating the two expressions for $h'(1)$, we find that $$ \boxed{ \quad \nabla f(\bfx) \cdot \bfx = \alpha f(\bfx).\quad} $$ Since $\bfx$ was an arbitrary nonzero element of $\R^n$, we conclude that this holds at all such points, whenever $f$ is a differentiable homogeneous function of degree $\alpha$.

Example 4: level sets and the gradient

Next we use the chain rule to prove a basic property of the gradient. Assume that $S$ is an open subset of $\R^n$ and that $f:S\to \R$ is differentiable at $\bfa$. Then \begin{equation}\label{lsg} \boxed{ \ \ " \nabla f(\bfa) \mbox{ is orthogonal to the level set of $f$ that passes through $\bfa$.} ''\ \ } \end{equation}

We will first explain more precisely what this is supposed to mean, then justify it. To start, let's introduce some notation. Let \begin{equation}\label{lsnot} c := f(\bfa), \qquad \mbox{ and } \quad C := \{ \bfx \in S : f(\bfx) = c\}. \end{equation} Thus $C$ is the level set of $f$ that passes through $\bfa$. We will understand \eqref{lsg} to mean that \begin{equation} \nabla f(\bfa)\cdot {\bf v} = 0\qquad\mbox{ for every vector $\bf v$ that is tangent to $C$ at $\bfa$.} \label{lsg2} \end{equation}

Note, if $\nabla f(\bfa)= {\bf 0}$, then $\nabla f(\bfa)\cdot {\bf v}=0$ for every $\bf v$, and \eqref{lsg} is trivial. So we will assume that $$\nabla f(\bfa)\ne {\bf 0}. $$ This assumption has some interesting and relevant geometric consequences that we will discuss in detail in a few weeks.

To prove that \eqref{lsg2} is true, we need to say what we mean by every vector $\bf v$ that is tangent to $C$ at $\bfa$.

To do this, consider any open interval $I\subset \R$ such that $0\in I$, and let $\gamma:I \to \R^n$ be a curve such that \begin{equation} \gamma(t)\in C\quad \mbox{ for all }t\in I, \qquad \gamma(0) = \bfa,\qquad \mbox{ and } \gamma'(0) \mbox{ exists}. \label{gamma1}\end{equation} We now define: $$ {\bf v}\mbox{ is tangent to }C \mbox{ at }\bfa \quad \mbox{ if }\quad {\bf v} = \gamma'(0) \mbox{ for some curve }\gamma \mbox{ satisfying \eqref{gamma1}.} $$

So to prove \eqref{lsg2} (and hence \eqref{lsg}) we must show that \begin{equation} \mbox{ if $\gamma$ is any differentiable curve satisfying \eqref{gamma1},} \ \ \ \mbox{ then } \ \ \ \gamma'(0)\cdot \nabla f(\bfa) = 0. \label{lsg3}\end{equation}

Proof that \eqref{lsg3} holds Fix an open interval $I\subset \R$ containing $0$ and a curve $\gamma$ satisfying \eqref{gamma1}. Let's write $h(t) := f\circ \gamma(t)$. In view of the definition of the level set $C$, the assumption that $\gamma(t)\in C$ for all $t\in I$ means that $h(t) = f(\gamma(t))=c$ for all $t\in I$.

Thus $$ h'(t)= 0\mbox{ for all }t, \qquad\mbox{ and in particular, }\quad h'(0)=0. $$ On the other hand, we can compute $h'(0)$ using the chain rule, to find that $$ 0 = h'(0) = (f\circ \gamma)'(0) = \nabla f(\gamma(0)) \cdot \gamma'(0)
\overset{\eqref{lsg3}}= \nabla f(\bfa)\cdot \gamma'(0). $$ This proves \eqref{lsg3}. $\quad \Box$

In a way this discussion is incomplete. It would be better if we could say that \begin{equation}\label{tv2} {\bf v}\mbox{ is tangent to $ C$} \qquad \iff \qquad \nabla f(\bfa)\cdot {\bf v} = 0. \end{equation} So far we have only proved that the implication $ \Longrightarrow$ holds. Alsthough we have not proved it, in fact \eqref{tv2} is true. We will return to this point later.

Tangent plane to a level set

The above discussion motivates some definitions.

Suppose $S$ is an open subset of $\R^3$ and that $f:S\to \R$ is a function that is differentiable at a point $\bfa\in S$. Assume also that $$ \nabla f(\bfa)\ne 0. $$ Let's also continue to use the notation \eqref{lsnot}, so that for example $C$ denotes the level set of $f$ that passes through $\bfa$.

We define \begin{equation}\label{tp.def} \mbox{ the tangent plane to }C \mbox{ at }\bfa := \{ \bfx \in \R^3 : (\bfx - \bfa)\cdot \nabla f(\bfa) = 0 \}. \end{equation} If we are willing to accept \eqref{tv2}, then the definition states that a point $\bfx$ belongs to the tangent plane to $C$ at $\bfa$ if and only if $\bfx$ has the form $\bfx = \bfa+{\bf v}$, where $\bf v$ is tangent to the level set at $\bfa$.

Example 5. Find the tangent plane to the surface $$ C := \{ (x,y,z)\in \R^3 : x^2 - 2xy +4yz - z^2 = 2\} $$ at the point $\bfa :=(1,1,1)$.

We just apply the definition to find that \begin{align} \mbox{ tangent plane to }C\mbox{ at }\bfa \mbox{ is } & \{ (x,y,z)\in \R^3 : (x-1, y-1, z-1)\cdot (0,2, 2) = 0\} \nonumber \\ & \{ (x,y,z)\in \R^3 : y+z = 2 \}. \nonumber \end{align}

Note, if we had said find the equation for the tangent plane instead of find the tangent plane, then the answer would have been simply $y+z=2$.

Proof of the Chain Rule (optional)

We now sketch the proof of the chain rule.

Recall the assumptions:

The definition of the differentability involves error terms which we typically write as $E({\bf h})$. In this proof we have to keep track of several different error terms, so we will use subscripts to distinguish between them. For example, we will write $E_{\bfg, \bfa}( {\bf h})$ to denote the error term for $\bfg$ near the point $\bfa$.

It turns out to make the notation easier to write $M$ instead of $D\bfg(\bfa)$ and $N$ instead of $D\bff(\bfg(\bfa)) = D\bff(\bfb)$. Thus, $M$ is the (unique) $m\times n$ matrix such that \begin{equation}\label{dga} \bfg(\bfa +{\bf h}) = \bfg(\bfa ) + M \bfh + E_{\bfg, \bfa}({\bf h})\qquad\mbox{ where } \lim_{\bfh \to 0}\frac 1{|\bf h|} E_{\bfg, \bfa}({\bf h}) = 0, \end{equation} and similarly, $N$ is characterized by the fact that \begin{equation}\label{dfb} \bff(\bfb +{\bf k}) = \bff(\bfb ) + N {\bf k} + E_{\bff, \bfb}({\bf k})\qquad\mbox{ where } \lim_{\bf k \to 0}\frac 1{|\bf k|} E_{\bff, \bfb}({\bf k}) = 0. \end{equation} Using \eqref{dga}, we find that \begin{align} \bff(\bfg(\bfa +{\bf h})) &= \bff\Big( \overbrace{\bfg(\bfa) }^{\bfb}+ \overbrace{M \, {\bf h} + E_{\bfg, \bfa}({\bf h}) }^{\bf k}\Big) \\ &\overset{\eqref{dfb}}= \bff(\overbrace {\bfg(\bfa)}^\bfb) + N( \overbrace{M \, {\bf h} + E_{\bfg, \bfa}({\bf h}) }^{\bf k}) + E_{\bff, \bfb}({\bf k})\\ &= \bff(\bfg(\bfa)) + NM{\bf h} \ + \ N E_{\bfg, \bfa}({\bf h}) +E_{\bff, \bfb}({\bf k}) \end{align} We can rewrite this as \begin{multline} \bff\circ \bfg(\bfa+{\bfh}) = \bff\circ \bfg(\bfa) + N M\, {\bf h} + E_{\bff\circ \bfg, \bfa}(\bfh),\\ \qquad\mbox{ where } E_{\bff\circ \bfg, \bfa}(\bfh) : = NE_{\bfg, \bfa}({\bf h}) + E_{\bff, \bfb}({\bf k}), \qquad {\bf k} = M{\bf h}+ E_{\bfg, \bfa}(\bf h). \nonumber \end{multline} Since $N M = D\bff(\bfg(\bfa)) D\bfg(a)$, this will imply that the chain rule, if we can verify that \begin{equation}\label{cr.proof} \lim_{\bf h\to 0} \frac 1{|\bf h|} E_{\bff\circ \bfg, \bfa}(\bfh) = 0. \end{equation}

The proof of \eqref{cr.proof} is even more optional than the rest of the proof, but if you are interested it can be found here.

We consider separate pieces of the error term one after another. We will be terse.

First, since $N$ is a fixed matrix, the exists a number $C$ such that $|N {\bf v}| \le C |\bf v|$ for all $\bf v\in \R^m$. Since $E_{\bfg, \bfa}({\bf h})$ is a vector in $\R^m$, it follows that \begin{equation}\label{cr.p1} \frac 1{|\bf h|} |NE_{\bfg, \bfa}({\bf h})| \le \frac C{|\bf h|} |E_{\bfg, \bfa}({\bf h})| \to 0 \mbox{ as }{\bf h}\to 0. \end{equation} Also, $$ \frac 1{|\bf h|} |E_{\bff, \bfb}({\bf k})| = \frac {|\bf k|} {|\bf h|} \frac 1{|\bf k|} |E_{\bff, \bfb}({\bf k})| . $$ Since $M$ is a fixed matrix and $\lim_{\bf h\to 0} \frac{E_{\bfg, \bfa}(\bfh)}{|\bf h|} = 0$, one can check that there exists some constant $C$ such that $$ |{\bf k}| \le C {|\bf h} | \qquad \mbox{ whenever } 0 <|{\bf h}| < 1. $$ It follows that ${\bf k}\to 0$ as $\bf h \to 0$, and hence that \begin{equation}\label{cr.p2} \lim_{\bfh \to {\bf 0}}\frac 1{|\bf h|} |E_{\bff, \bfb}({\bf k})| \le \lim_{\bf k \to {\bf 0}}C \frac 1{|\bf k|} |E_{\bff, \bfb}({\bf k})| = {\bf 0}. \end{equation} Finally, we deduce \eqref{cr.proof} by adding up \eqref{cr.p1} and \eqref{cr.p2}.

Be careful !

With the chain rule, it is not hard to get tripped up by obscure or ambiguous notation. We will illustrate this with an example.

Suppose we are given $f:\R^3\to \R$, which we will write as a function of variables $(x,y,z)$. Further assume that $\bfG:\R^2\to \R^3$ is a function of variables $(u,v)$, of the form $$ \bfG(u,v) = (u, v, g(u,v)) \qquad\mbox{ for some }g:\R^2\to \R. $$ Let's write $\phi := f\circ \bfG$. Then a routine application of the chain rule tells us that $$ ( \partial_u \phi \ \ \ \partial_v \phi ) \ = \ \left(\begin{array}{ccc} \frac{\partial f}{\partial x}\circ \bfG & \ \frac{\partial f}{\partial y}\circ \bfG &\ \frac{\partial f}{\partial z}\circ \bfG \end{array}\right) \left(\begin{array}{cc} 1&0\\ 0&1 \\ \frac{\partial g}{\partial u}& \frac{\partial g}{\partial v} \end{array}\right) $$ For simplicity, considering only the $u$ derivative, this says that $$ \frac{\partial \phi}{\partial u}(u,v) = \frac{\partial f}{\partial x} (u,v,g(u,v)) + \frac{\partial f}{\partial z}(u,v,g(u,v)) \frac {\partial g}{\partial u}(u,v). $$

This is perfectly correct but a little complicated. After all, since $x=u$ and $y=v$, it might be simpler to write $\bfG$ as a function of $x$ and $y$ rather than $u$ and $v$, ie $\bfG(x,y) = (x,y,g(x,y))$. Then we would write $$ \phi(x,y) = f(\bfG(x,y)) = f(x,y,g(x,y)), $$ and our formula for the derivative becomes (simply changing $(u,v)$ to $(x,y)$) \begin{equation} \frac{\partial \phi}{\partial x}(x,y) = \frac{\partial f}{\partial x} (x,y,g(x,y)) + \frac{\partial f}{\partial z}(x,y,g(x,y)) \frac {\partial g}{\partial x}(x,y). \label{wo05}\end{equation} However, this is a little ambiguous, since if someone sees the expression \begin{equation} \frac{\partial f}{\partial x} (x,y,g(x,y)) \label{wo1}\end{equation} they can be legitimately confused about whether it means

a solution to this conundrum that no one uses (so can be ignored). The ambiguity could easily be resolved by adroit use of parentheses, one purpose of which is to indicate the order of operations. For example, it would be clear if we write \begin{align} (\frac{\partial f}{\partial x}) (x,y,g(x,y))\quad&\mbox{ to mean: first differentiate then substitute}\nonumber \\ \frac{\partial }{\partial x} (f (x,y,g(x,y)))\quad&\mbox{ to mean: first substitute then differentiate}\nonumber \end{align} But for some reason people almost never do this.

In this case the first interpretation is correct; the second interpretation is exactly what we called $\frac{\partial \phi}{\partial x}$. So expressions like \eqref{wo1} can be confusing, and \eqref{wo05} is only correct if the reader is able to figure out correctly what it means (which is arguably not so hard in this case).

One way to avoid using this problem is to write the derivatives of $f$ as $\partial_1 f,$ instead of $\frac{\partial f}{\partial x}$ or $\partial_x f$, and similarly $\partial_2 f$, etc. Then \eqref{wo05} looks like \begin{equation}\label{last} \partial_1\phi(x,y) = \partial_1 f(x,y,g(x,y)) + \partial_3 f(x,y,g(x,y)) \partial_1g(x,y), \end{equation} and this is correct and unambiguous, though still a little awkward. It can be written more simply as $$ \partial_1\phi= \partial_1 f + \partial_3 f\partial_1g, $$ as long as we trust our readers to figure out that derivatives of $g$ are evaluated at $(x,y)$ and derivatives of $f$ at $(x,y,g(x,y))$.

One can also get into more serious trouble, for example as follows. Let's continue to write $g$ as a function of $(x,y)$ rather than $(u,v)$, and let's also write $w$ as the name of the variable that is the output of the function $f$, that is, $w = f(x,y,z)$ Then we can write $$ w = f(x,y,z) \qquad \mbox{ where } \qquad z = g(x,y) $$ Suppose we want to know about rates of change in $w$ in response to infinitesimal or small changes in $x$, always restricting our attention to the set of points where $z=g(x,y)$. Since $z$ depends on $x$, we have to use the chain rule. If we use the notation \eqref{cr.trad}, we might write $$ \frac{\partial w}{\partial x} = \frac{\partial w}{\partial x}\frac{\partial x}{\partial x}+ \frac{\partial w}{\partial y}\frac{\partial y}{\partial x}+ \frac{\partial w}{\partial z}\frac{\partial z}{\partial x}. $$ But it is clear that $\frac{\partial x}{\partial x}= 1$, and (because $x$ and $y$ are independent variables) that $\frac{\partial y}{\partial x} = 0$. Thus the above equation reduces to \begin{equation} \frac{\partial w}{\partial x} = \frac{\partial w}{\partial x} + \frac{\partial w}{\partial z}\frac{\partial z}{\partial x}. \label{cr.example}\end{equation} It follows that \begin{equation}\label{wrong} \frac{\partial w}{\partial z}\frac{\partial z}{\partial x} = 0. \end{equation} This is worse than ambiguous --- it is wrong! For example, suppose that $f(x,y,z) = z$ and $g(x,y) = x$. Then it is clear that $ \frac{\partial w}{\partial z}\frac{\partial z}{\partial x} =1$, showing that \eqref{wrong} is cannot be true.

The problem is that here we have written $\frac{\partial w}{\partial x}$ to mean two different things: on the left-hand side, it is $\partial_1 \phi(x,y)$, and on the right-hand side it is $\partial_1 f(x,y,g(x,y))$, using notation from \eqref{last}. But if we insist on using the notation \eqref{cr.trad}, then there is no very easy way of distinguishing betwen these two different things.

summary of this discussion

You can never go wrong if you apply the chain rule correctly and carefully -- after all, it's a theorem. But bad choices of notation can lead to ambiguity or mistakes. You should be aware of this when you are

On the other hand, shorter and more elegant formulas are often easier for the mind to absorb, and when there is no chance of confusion, this can be a reason to prefer them over complicated formulas that spell out every least nuance in mind-numbing detail.

Problems

Basic skills

Questions involving the chain rule are guaranteed to appear on quizzes, at least one Term Test, and on the Final Exam. (Such questions may also involve additional material that we have not yet studied, such as higher-order derivatives....) . If the questions here do not give you enough practice, you can easily make up additional questions of a simlar character. You can also find questions of this sort in Folland, Section 2.3 (with some solutions in the back of the book.)

  1. Compute derivatives using the chain rule. For example,

  2. Use the chain rule to find relations between different partial derivatives of a function. For example,

  3. Find the tangent plane to the set $\ldots$ at the point $\bfa = \ldots$. For example:

More advanced questions

  1. (a) Let $q:\R^n\to \R$ be the (quadratic) function defined by $q(\bfx) = |\bfx|^2$. Determine $\nabla q$ (either by differentiating, or by remembering material from one of the tutorials.)
    (b) Assume that $\bfx:\R\to \R^3$ is a function that describes the trajectory of a particle. Thus $\bfx(t)$ is the particle's position at time $t$.
    If $\bfx$ is differentiable, then we will write ${\bf v}(t) = \bfx'(t)$, and we say that ${\bf v}(t)$ is the velocity vector at time $t$.
    Simlarly, if $\bf v$ is differentiable, then we will write ${\bf a }(t)= {\bf v}'(t)$, and we say that ${\bf a}(t)$ is the acceleration vector at time $t$.
    We also say that $|{\bf v}(t)|$ is the speed at time $t$.
    Prove that $$ \mbox{ the speed is constant } \qquad \iff \qquad \bfa(t) \cdot {\bf v}(t) = 0\mbox{ for all }t. $$

    For the next three exercises, let $M^{n\times n}$ denote the space of $n\times n$ matrices, Let's write a typical element of $M^{n\times n}$ as a matrix $X$ with entries as shown: $$ X = \left( \begin{array}{ccc} x_{11} & \cdots & x_{1n}\\ \vdots & \ddots & \vdots\\ x_{n1} & \cdots & x_{nn} \end{array}\right) $$
    Now let's define a function $\det:M^{n\times n}\to \R$ by saying that $\det(X)$ is the determinant of the matrix. We can view this as a function of the variables $x_{11},\ldots, x_{nn}$.

  2. For $2\times 2$ matrices, compute $$ \frac{\partial}{\partial x_{11}} \det(X), \quad \frac{\partial}{\partial x_{12} }\det(X), \quad \frac{\partial}{\partial x_{21} }\det(X), \quad \frac{\partial}{\partial x_{22} }\det(X). $$ This is easy once you figure out what you have to do.

  3. Now consider $n\times n$ matrices for an arbitrary positive integer $n$. Let $I$ denote the $n\times n$ identity matrix. Thus, in terms of the variables $(x_{ij})$, $I$ corresponds to $$ x_{ij}=\begin{cases}1&\mbox{ if }i=j\\ 0&\mbox{ if }i\ne j \end{cases} $$ For every $i$ and $j$, compute $$ \frac{\partial}{\partial x_{ij}} \det(I), $$ This means: the derivative of the determinant function, evaluated at the identity matrix.
    (You will find that there are two cases: $i=j$ and $i\ne j$.)
    To see what you have to do, note for example that from the definition of partial derivatives, to compute $\frac{\partial}{\partial x_{11}}\det(I)$, you have to look at $$ \lim_{h\to 0} \frac 1 h \left[ \det \left( \begin{array}{ccccc} 1+h & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{array}\right) \ - \ \det \left( \begin{array}{ccccc} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{array}\right) \right] = ? $$

  4. Now suppose that $X(t)$ is a differentiable curve in the space of matrices, in other words, that $$ X(t) = \left( \begin{array}{ccc} x_{11}(t) & \cdots & x_{1n}(t) \\ \vdots & \ddots & \vdots\\ x_{n1}(t) & \cdots & x_{nn}(t) \end{array}\right) $$ where $x_{ij}(t)$ is a differentiable function of $t\in \R$, for all $i,j$. Also suppose that $X(0) = I$.
    Use the chain rule and the above exercise to find a formula for $$ \left. \frac d{dt} \det(X(t))\right|_{t=0} $$ in terms of $ x_{ij}'(0)$, for $i,j=1,\ldots, n$.

  5. Here we sketch a proof of the Chain Rule that may be a little simpler thn the (optional) proof presented above. To simplify the set-up, let's assume that $\bfg:\R\to \R^n$ and $f:\R^n\to \R$ are both functions of class $C^1$. Define $\phi = f\circ \bfg$. Thus $\phi$ is a function $\R\to \R$. Your goal is to compute its derivative at a point $t\in \R$.
    To simplify still further, let's assume that $n=2$.
    Let's write $\bfg(t) = (x(t), y(t))$. Then \begin{align} \phi(t+h)-\phi(t) &= f(\bfg(t+h)) - f(\bfg(t)) \nonumber \\ &= f( x(t+h), y(t+h)) - f(x(t),y(t)) \nonumber \\ &= [f( x(t+h), y(t+h)) - f(x(t+h),y(t))]\nonumber \\ &\qquad \qquad\qquad+ [f(x(t+h),y(t)) -f(x(t),y(t))] . \nonumber \\ \end{align} Starting from the above, mimic the proof of Theorem 3 in Section 2.1 to show that $$ \phi'(t) = \lim_{h\to 0}\frac 1 h(\phi(t+h)-\phi(t) ) \mbox{ exists and equals } \frac{\partial f}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial t}, $$ where the partial derivatives of $f$ on the right-hand side are evaluated at $(x(t),y(t)) = \bfg(t)$.

    $\Leftarrow$  $\Uparrow$  $\Rightarrow$