2.3 The Chain Rule

\(\renewcommand{\R}{\mathbb R }\)

2.3 The Chain Rule

  1. The Chain Rule
  2. Important special cases
  3. Some examples
  4. Proof of the Chain Rule (Optional)
  5. Differentiate, then substitute
  6. Problems

\(\Leftarrow\)  \(\Uparrow\)  \(\Rightarrow\)

The Chain Rule

The chain rule from single variable calculus has a direct analogue in multivariable calculus, where the derivative of each function is replaced by its Jacobian matrix, and multiplication is replaced with matrix multiplication. As usual, we have generalized open intervals to open sets.

Suppose that \(S\) and \(T\) are open subsets of \(\R^n\) and \(\R^m\), and that we are given functions \(\mathbf g: S\to \R^m\) and \(\mathbf f:T\to \R^\ell\). Suppose also that \(\mathbf a\in S\) is a point such that \(\mathbf g(\mathbf a)\in T\); thus \(\mathbf f\circ \mathbf g(\mathbf x) = \mathbf f (\mathbf g(\mathbf x))\) is well-defined for all \(\mathbf x\) close to \(\mathbf a\).

If \(\mathbf g\) is differentiable at \(\mathbf a\) and \(\mathbf f\) is differentiable at \(\mathbf g(\mathbf a)\), then the composite function \(\mathbf f\circ \mathbf g\) is differentiable at \(\mathbf a\), and \[\begin{equation}\label{cr1} D(\mathbf f\circ\mathbf g)(\mathbf a) = [D\mathbf f(\mathbf g(\mathbf a))] \ [D\mathbf g(\mathbf a)]. \end{equation}\]

It may be helpful to write out \(\eqref{cr1}\) in terms of components and partial derivatives. We can write \(\mathbf f\) as a function of variables \(\mathbf y = (y_1,\ldots, y_m)\in \R^m\), and \(\mathbf g\) as a function of \(\mathbf x = (x_1,\ldots, x_n)\in \R^n\). Then for each \(k=1,\dots, \ell\) and \(j=1,\ldots, n\), \(\eqref{cr1}\) is the same as \[\begin{equation}\label{crcoord} \frac {\partial }{\partial x_j} (f_k\circ \mathbf g)(\mathbf a) =\sum_{i=1}^m \frac{\partial f_k}{\partial y_i}(\mathbf g(\mathbf a)) \ \frac{\partial g_i}{\partial x_j}(\mathbf a). \end{equation}\] This can be checked by writing out both sides of \(\eqref{cr1}\) — the left-hand side is the \((k,j)\) component of the matrix \(D(\mathbf f\circ \mathbf g)(\mathbf a)\), and the right-hand side is the \((k,j)\) component of the matrix product \([D\mathbf f(\mathbf g(\mathbf a))] \ [D\mathbf g(\mathbf a)]\).

The chain rule is also sometimes written in the following way: We write \({\bf u} = (u_1,\ldots, u_\ell)\) to denote a typical point in \(\R^\ell\). As above, we write \(\mathbf x = (x_1,\ldots, x_n)\) and \(\mathbf y = (y_1,\ldots, y_m)\) to denote typical points in \(\R^n\) and \(\R^m\). If we suppose that the \(\mathbf x, \mathbf y\) and \(\bf u\) variables are related by \[ \mathbf y = \mathbf g(\mathbf x), \qquad {\bf u} = \mathbf f(\mathbf y) = \mathbf f(\mathbf g(\mathbf x)), \] then it is traditional to write, for example, \(\dfrac{\partial u_k}{\partial x_j}\) to denote the infinitesimal change in the \(k\)th component of \(\mathbf u\) in response to an infinitesimal change in \(x_j\), that is, \(\frac{\partial u_k}{\partial x_j} = \frac{\partial } {\partial x_j}( f_k\circ \mathbf g)\). Using this notation, and with similar interpretations for \(\frac{\partial u_k}{\partial y_i}\) and \(\frac{\partial y_i}{\partial x_j}\), we can write the chain rule in the form \[\begin{equation}\label{cr.trad} \frac{\partial u_k}{\partial x_j} = \frac{\partial u_k}{\partial y_1}\frac{\partial y_1}{\partial x_j} + \cdots +\frac{\partial u_k}{\partial y_m}\frac{\partial y_m}{\partial x_j} \end{equation}\] for \(k=1,\dots, \ell\) and \(j=1,\ldots, n\). We emphasize that this is just a rewriting of the chain rule in suggestive notation, and its actual meaning is identical to that of \(\eqref{crcoord}\).

Important special cases

The case \(\ell=1\).

We most often apply the chain rule to compositions \(f\circ \mathbf g\), where \(f\) is a real-valued function. In this case, formula \(\eqref{cr1}\) simplifies to \[\begin{equation}\label{cr.scalar} D(f\circ\mathbf g)(\mathbf a) = [Df(\mathbf g(\mathbf a))] \ [D\mathbf g(\mathbf a)]. \end{equation}\] where \(Df\) is a \(1\times m\) matrix, that is, a row vector, and \(D(f\circ \mathbf g)\) is a \(1\times n\) matrix, also a row vector (but with length \(n\)). Also, the alternate notation \(\eqref{cr.trad}\) simplifies to \[ \frac{\partial u}{\partial x_j} = \frac{\partial u}{\partial y_1}\frac{\partial y_1}{\partial x_j} + \cdots +\frac{\partial u}{\partial y_m}\frac{\partial y_m}{\partial x_j}\ \quad \text{ for }j=1,\ldots, n. \]

The general form \(\eqref{cr1}\) of the chain rule says that for a vector function \(\mathbf f\), every component \(f_k\) satisfies \(\eqref{cr.scalar}\), for \(k=1,\ldots, \ell\).

The case \(\ell=n=1\).

Specializing still more, a case that arises often is \(\mathbf g:\R \to \R^m\) and \(f:\R^m\to \R\). Then \(f\circ \mathbf g\) is a function \(\R\to \R\), and the chain rule states that \[\begin{align}\label{crsc1} \frac{d}{dt} (f\circ \mathbf g)(t) &= \sum_{j=1}^m \frac{\partial f}{\partial x_j}(\mathbf g(t)) \frac{d g_j}{dt}(t) \ = \ \nabla f(\mathbf g(t)) \cdot \mathbf g'(t). \end{align}\] If we write this in the condensed notation used in \(\eqref{cr.trad}\) above, with variables \(u\) and \(\mathbf x = (x_1,\ldots, x_m)\) and \(t\) related by \(u = f(\mathbf x)\) and \(\mathbf x = \mathbf g(t)\), then we get \[ \frac{d u}{dt} = \frac {\partial u}{\partial x_1}\frac{d x_1}{dt} +\cdots + \frac {\partial u}{\partial x_m}\frac{d x_m}{dt}. \] This means the same as \(\eqref{crsc1}\), but you may find that it is easier to remember when written this way.

See Example 2 below for an illustration of this special case.

Some examples

Example 1: Polar coordinates. Suppose we have a function \(f:\R^2\to \R\), and we would like to know how it changes with respect to distance or angle from the origin, that is, what are its derivatives in polar coordinates. Let \[ S = \{ (r,\theta)\in \R^2 : r\geq0 \} \] and define \(\mathbf g:S\to \R^2\) by \(\mathbf g(r,\theta) =(r\cos \theta, r\sin \theta).\)

If \((x,y) = \mathbf g(r,\theta)\), then geometrically \(r\) is the distance between \((x,y)\) and the origin, and \(\theta\) is the angle between the positive \(x\)-axis and the line from the origin to \((x,y)\).

drawing

Now suppose that we are given a function \(f:\R^2\to \R\). Let’s write \(\phi\) to denote the composite function \(\phi = f\circ \mathbf g\), so \[ \phi(r,\theta) = f(r\cos\theta, r\sin\theta). \] Then \[ D\mathbf g = \left( \begin{array}{rr} \cos \theta & -r\sin\theta\\ \sin\theta & r\cos\theta\end{array} \right) \] So the Chain Rule says that \[ D\phi = \big( \partial_r \phi \ \ \ \partial_\theta \phi) = (\partial_x f \ \ \partial_y f) \left( \begin{array}{rr} \cos \theta & -r\sin\theta\\ \sin\theta & r\cos\theta\end{array} \right) \] where we have to remember that \(\partial_x f\) and \(\partial_y f\) are evaluated at \(\mathbf g(r,\theta)\). We can write this out in more detail as \[\begin{align*} \partial_r \phi = \partial_r (f\circ \mathbf g) &= \partial_x f(r\cos\theta,r\sin\theta) \cos \theta + \partial_y f(r\cos\theta,r\sin\theta) \sin \theta , \\ \partial_\theta \phi = \partial_\theta (f\circ \mathbf g) &= -\partial_x f(r\cos\theta,r\sin\theta) r\sin \theta + \partial_y f(r\cos\theta,r\sin\theta) r\cos \theta \end{align*}\]

If we use the notation \(\eqref{cr.trad}\), then the chain rule takes the form \[\begin{align*} \frac {\partial \phi} {\partial r} &= \frac{ \partial \phi}{\partial x} \frac{ \partial x}{\partial r} + \frac{ \partial \phi}{\partial y} \frac{ \partial y}{\partial r} \\ &= \frac{ \partial \phi}{\partial x}\ \cos \theta + \frac{ \partial \phi}{\partial y}\ \sin \theta \\ \frac {\partial \phi} {\partial \theta} &= \frac{ \partial \phi}{\partial x} \frac{ \partial x}{\partial \theta} + \frac{ \partial \phi}{\partial y} \frac{ \partial y}{\partial \theta} \\ &=- \frac{ \partial \phi}{\partial x} \ r \sin \theta + \frac{ \partial \phi}{\partial y} \ r\cos\theta \end{align*}\]

The formulas for \(\partial_r\phi\) and \(\frac{\partial \phi}{\partial r}\) mean exactly the same thing; the differences are only choices of notation. One significant difference is that we have written for example \(\partial_x f(r\cos\theta, r\sin \theta)\) in the first and \(\frac{\partial u}{\partial x}\) in the second, and similarly for the \(y\) derivatives. In particular, in the first formula, we have explicitly indicated that derivatives of \(f\) should be evaluated at \(g(r,\theta)\), whereas in the second, we have left it to the reader to figure this out.

Example 2: Motion of a particle Let \(\mathbf g:\R\to \R^n\) be a differentiable function, and consider \[ \phi(t) = |\mathbf g(t)| = f(\mathbf g(t))\qquad\text{ for }\quad f(\mathbf x) = |\mathbf x|. \] If we think of \(\mathbf g(t)\) as being the position at time \(t\) of a particle that is moving around in \(\R^n\), then \(\phi(t)\) is the distance at time \(t\) of the particle from the origin.

Exercise: If you have not already done it, check that \(f\) is differentiable everywhere except at the origin, and that \[ \nabla f(\mathbf x) = \frac{\mathbf x}{|\mathbf x|}\qquad\text{ for }\mathbf x\ne {\bf 0}. \]

Geometric meaning of the formula for \(\nabla f\) Recall what we know about \(\nabla f(\mathbf x)\): it is a vector that points in the direction where \(f\) has the greatest increase near \(\mathbf x\), and its magnitude is the rate of increase in that direction. At a point \(\mathbf x\), if we want to increase the distance from the origin as quickly as possible, we should move directly away from the origin, that is, in the direction of \(\mathbf x\). So \(\nabla f(\mathbf x)\) should have the form \(\lambda \mathbf x\) for some \(\lambda > 0\). And if we move in this direction, our distance from the origin will increase at exactly a rate of 1 unit of distance per unit of distance travelled. So the rate of increase is \(1\), and thus \(|\nabla f(\mathbf x)|\) should equal \(1\). Putting these together, we conclude that \(\nabla f(\mathbf x)\) should equal \(\dfrac {\mathbf x}{|\mathbf x|}\). Of course, this is not a proof but it may be helpful for our understanding.
Then using the chain rule \(\eqref{crsc1}\) and properties of the dot product, we find that \[ \frac d{dt} |\mathbf g(t)| = \frac{\mathbf g(t)}{|\mathbf g(t)|}\cdot \mathbf g'(t) = |\mathbf g'(t)| \cos \theta \] where \(\theta\) is the angle between \(\frac{\mathbf g(t)}{|\mathbf g(t)|}\) and \(\mathbf g'(t)\). This says that the rate at which distance from \({\bf 0 }\) is changing equals the speed \(|\mathbf g'(t)|\) at which the particle is moving multiplied by the cosine of the angle between the direction of motion and the direction pointing exactly away from the origin.

Example 3: Homogeneous functions.

A function \(f:\R^n\to \R\) is said to be homogeneous of degree \(\alpha\) if \[ f(\lambda \mathbf x) = \lambda^\alpha f(\mathbf x)\quad\text{ for all }\mathbf x\ne{\bf 0}\text{ and }\lambda>0. \] These were introduced in one of the problems in Section 2.1 For example, the monomial \(x^ay^bz^c\) is homogeneous of degree \(\alpha = a+b+c\). The definition of homogeneous also applies if the domain of \(f\) does not include the origin, and for the present discussion, it does not matter whether or not \(f({\bf 0})\) is defined.

An interesting fact about homogeneous functions can be proved using the chain rule: \(\nabla f(\mathbf x) \cdot \mathbf x = \alpha f(\mathbf x)\) for any \(\mathbf x \neq \mathbf 0\).
Let \(f:\R^n\to \R\) be homogeneous of degree \(\alpha\). For a nonzero vector \(\mathbf x\in \R^n\), and functions \(\mathbf g:(0,\infty)\to \R^n\) and \(h:(0,\infty)\to \R\) defined by \[ \mathbf g(\lambda) = \lambda \mathbf x, \qquad\qquad h(\lambda) = f(\mathbf g(\lambda)) = f(\lambda \mathbf x). \] We know that \(h(\lambda) = \lambda^\alpha f(\mathbf x)\), so \[ h'(\lambda) = \alpha \lambda^{\alpha-1} f(\mathbf x). \] On the other hand, we can compute \(h' = (f\circ \mathbf g)'\) using the chain rule. This leads to \[ h'(\lambda) = \nabla f(\lambda \mathbf x) \cdot \mathbf x \] since \(\mathbf g'(\lambda)=\mathbf x\). Setting \(\lambda = 1\) and equating the two expressions for \(h'(1)\), we find that \[ \nabla f(\mathbf x) \cdot \mathbf x = \alpha f(\mathbf x). \] Since \(\mathbf x\) was an arbitrary nonzero element of \(\R^n\), we conclude that this holds at all such points, whenever \(f\) is a differentiable homogeneous function of degree \(\alpha\).

Example 4: Level sets and the gradient

Next we use the chain rule to prove a basic property of the gradient.
Suppose that \(S\) is an open subset of \(\R^n\) and that \(f:S\to \R\) is differentiable at \(\mathbf a\). Then \(\nabla f(\mathbf a)\) is orthogonal to the level set of \(f\) that passes through \(\mathbf a\).

We will first explain more precisely what this means. Let \[\begin{equation}\label{lsnot} c = f(\mathbf a), \qquad \text{ and } \quad C = \{ \mathbf x \in S : f(\mathbf x) = c\}. \end{equation}\] Thus \(C\) is the level set of \(f\) that passes through \(\mathbf a\). The theorem says that \[\begin{equation}\label{lsg2} \nabla f(\mathbf a)\cdot {\bf v} = 0\qquad\text{ for every vector $\bf v$ that is tangent to $C$ at $\mathbf a$.} \end{equation}\]

Note, if \(\nabla f(\mathbf a)= {\bf 0}\), then \(\nabla f(\mathbf a)\cdot {\bf v}=0\) for every \(\bf v\), and \(\eqref{lsg2}\) is trivial. So we will assume that \(\nabla f(\mathbf a)\ne {\bf 0}.\) This assumption has some interesting and relevant geometric consequences that we will discuss in detail later.

To prove that \(\eqref{lsg2}\) is true, we need to say what we mean by “every vector \(\bf v\) that is tangent to \(C\) at \(\mathbf a\).” To do this, consider any open interval \(I\subseteq \R\) such that \(0\in I\), and let \(\gamma:I \to \R^n\) be a parameterized curve such that \[\begin{equation} \gamma(t)\in C\quad \text{ for all }t\in I, \qquad \gamma(0) = \mathbf a,\qquad \text{ and } \gamma'(0) \text{ exists}. \label{gamma1}\end{equation}\] We now define:
The vector \(\mathbf v\) is tangent to \(C\) at \(\mathbf a\) if \(\mathbf v = \gamma'(0)\) for some curve \(\gamma\) satisfying \(\eqref{gamma1}\).

So to prove \(\eqref{lsg2}\), we must show that if \(\gamma\) is any differentiable curve satisfying \(\eqref{gamma1}\), then \[\begin{equation} \gamma'(0)\cdot \nabla f(\mathbf a) = 0. \label{lsg3}\end{equation}\]

Details for \(\eqref{lsg3}\)

Fix an open interval \(I\subseteq \R\) containing \(0\) and a curve \(\gamma\) satisfying \(\eqref{gamma1}\). Write \(h(t) = f\circ \gamma(t)\). By the definition of the level set \(C\), the assumption that \(\gamma(t)\in C\) for all \(t\in I\) means that \(h(t) = f(\gamma(t))=c\) for all \(t\in I\).

Thus \[ h'(t)= 0\text{ for all }t, \qquad\text{ and in particular, }\quad h'(0)=0. \] On the other hand, we can compute \(h'(0)\) using the chain rule, to find that \[ 0 = h'(0) = (f\circ \gamma)'(0) = \nabla f(\gamma(0)) \cdot \gamma'(0) = \nabla f(\mathbf a)\cdot \gamma'(0). \] This proves \(\eqref{lsg3}\).

In a way this discussion is incomplete. It would be better if we could say that \[\begin{equation}\label{tv2} {\mathbf v}\text{ is tangent to } C \text{ at } \mathbf a \qquad \iff \qquad \nabla f(\mathbf a)\cdot {\bf v} = 0. \end{equation}\] So far we have only proved that the implication \(\Longrightarrow\) holds. Although we have not proved it, in fact \(\eqref{tv2}\) is true. We will return to this point later.

Tangent plane to a level set

Example 4 motivates a definition that will be useful for discussing the geometry of the derivative.

Suppose \(S\) is an open subset of \(\R^3\) and that \(f:S\to \R\) is a function that is differentiable at a point \(\mathbf a\in S\). Let \(C\) denote the level set of \(f\) that passes through \(\mathbf a\). Assume also that \(\nabla f(\mathbf a)\ne 0.\) The tangent plane to \(C\) at \(\mathbf a\) is
\[\begin{equation}\label{tp.def} \{ \mathbf x \in \R^3 : (\mathbf x - \mathbf a)\cdot \nabla f(\mathbf a) = 0 \}. \end{equation}\]

Using \(\eqref{tv2}\), this definition states that a point \(\mathbf x\) belongs to the tangent plane to \(C\) at \(\mathbf a\) if and only if \(\mathbf x\) has the form \(\mathbf x = \mathbf a+{\bf v}\), where \(\bf v\) is tangent to the level set at \(\mathbf a\).

Example 5.

Find the tangent plane to the surface \[ C = \{ (x,y,z)\in \R^3 : x^2 - 2xy +4yz - z^2 = 2\} \] at the point \(\mathbf a =(1,1,1)\).

SolutionFirst, we compute \(\nabla f(x,y,z) = (2x-2y, -2x+4z, 4y-2z)\), so \(\nabla f(\mathbf a)=(0,2,2)\). Then we apply the definition of the tangent plane to \(C\) at \(\mathbf a\): \[\{ (x,y,z)\in \R^3 : (x-1, y-1, z-1)\cdot (0,2, 2) = 0\}= \{ (x,y,z)\in \R^3 : y+z = 2 \}. \]

Note, if we had said “find the equation for the tangent plane” instead of “find the tangent plane”, then the answer would have been “\(y+z=2\)”.

Proof of the Chain Rule (optional)

Here we sketch the proof of the chain rule.

Recall the assumptions:

  • \(S\) and \(T\) are open subsets of \(\R^n\) and \(\R^m\) respectively,
  • \(\mathbf g:S\to \R^m\) is differentiable at \(\mathbf a \in S\).
  • \(\mathbf f:T\to \R^\ell\) is differentiable at \(\mathbf b = \mathbf g(\mathbf a)\in T\).

The definition of differentiability involves error terms which we typically write as \(\mathbf E({\bf h})\). In this proof we have to keep track of several different error terms, so we will use subscripts to distinguish between them. For example, we will write \(\mathbf E_{\mathbf g, \mathbf a}( {\bf h})\) to denote the error term for \(\mathbf g\) near the point \(\mathbf a\).

It turns out to make the notation easier to write \(M\) instead of \(D\mathbf g(\mathbf a)\) and \(N\) instead of \(D\mathbf f(\mathbf g(\mathbf a)) = D\mathbf f(\mathbf b)\). Thus, \(M\) is the (unique) \(m\times n\) matrix such that \[\begin{equation}\label{dga} \mathbf g(\mathbf a +{\bf h}) = \mathbf g(\mathbf a ) + M \mathbf h + \mathbf E_{\mathbf g, \mathbf a}({\bf h})\qquad\text{ where } \lim_{\mathbf h \to \mathbf 0}\frac {\mathbf E_{\mathbf g, \mathbf a}({\bf h})}{|\bf h|} = \mathbf 0, \end{equation}\] and similarly, \(N\) is characterized by \[\begin{equation}\label{dfb} \mathbf f(\mathbf b +{\bf k}) = \mathbf f(\mathbf b ) + N {\bf k} + \mathbf E_{\mathbf f, \mathbf b}({\bf k})\qquad\text{ where } \lim_{\bf k \to \bf0}\frac {\mathbf E_{\mathbf f, \mathbf b}({\bf k})}{|\bf k|} = \mathbf 0. \end{equation}\] Using \(\eqref{dga}\), we find that \[\begin{align} \mathbf f(\mathbf g(\mathbf a +{\bf h})) &= \mathbf f\Big( \overbrace{\mathbf g(\mathbf a) }^{\mathbf b}+ \overbrace{M \, {\bf h} + \mathbf E_{\mathbf g, \mathbf a}({\bf h}) }^{\bf k}\Big) \\ &\overset{\eqref{dfb}}= \mathbf f(\overbrace {\mathbf g(\mathbf a)}^\mathbf b) + N( \overbrace{M \, {\bf h} + \mathbf E_{\mathbf g, \mathbf a}({\bf h}) }^{\bf k}) + \mathbf E_{\mathbf f, \mathbf b}({\bf k})\\ &= \mathbf f(\mathbf g(\mathbf a)) + NM{\bf h} \ + \ N \mathbf E_{\mathbf g, \mathbf a}({\bf h}) +\mathbf E_{\mathbf f, \mathbf b}({\bf k}) \end{align}\] We can rewrite this as \[\begin{multline} \mathbf f\circ \mathbf g(\mathbf a+{\mathbf h}) = \mathbf f\circ \mathbf g(\mathbf a) + N M\, {\bf h} + \mathbf E_{\mathbf f\circ \mathbf g, \mathbf a}(\mathbf h),\\ \qquad\text{ where } \mathbf E_{\mathbf f\circ \mathbf g, \mathbf a}(\mathbf h) : = N\mathbf E_{\mathbf g, \mathbf a}({\bf h}) + \mathbf E_{\mathbf f, \mathbf b}({\bf k}), \qquad {\bf k} = M{\bf h}+ \mathbf E_{\mathbf g, \mathbf a}(\bf h). \nonumber \end{multline}\] Since \(N M = D\mathbf f(\mathbf g(\mathbf a)) D\mathbf g(a)\), this will imply the chain rule, after we verify that \[\begin{equation}\label{cr.proof} \lim_{\bf h\to \bf0} \frac 1{|\bf h|} \mathbf E_{\mathbf f\circ \mathbf g, \mathbf a}(\mathbf h) = \bf0. \end{equation}\]

Details for\(\eqref{cr.proof}\) This is is even more optional than the rest of the proof, but if you are interested it can be found here.

We consider separate pieces of the error term one after another. We will be terse.

First, since \(N\) is a fixed matrix, there exists a number \(C\) such that \(|N {\bf v}| \le C |\bf v|\) for all \(\bf v\in \R^m\). Since \(\mathbf E_{\mathbf g, \mathbf a}({\bf h})\) is a vector in \(\R^m\), it follows that \[\begin{equation}\label{cr.p1} \frac 1{|\bf h|} |N\mathbf E_{\mathbf g, \mathbf a}({\bf h})| \le \frac C{|\bf h|} |\mathbf E_{\mathbf g, \mathbf a}({\bf h})| \to 0 \text{ as }{\bf h}\to \bf0. \end{equation}\] Also, \[ \frac 1{|\bf h|} |\mathbf E_{\mathbf f, \mathbf b}({\bf k})| = \frac {|\bf k|} {|\bf h|} \frac 1{|\bf k|} |\mathbf E_{\mathbf f, \mathbf b}({\bf k})| . \] Since \(M\) is a fixed matrix and \(\lim_{\bf h\to \bf0} \frac{\mathbf E_{\mathbf g, \mathbf a}(\mathbf h)}{|\bf h|} = \bf0\), one can check that there exists some constant \(D\) such that \[ |{\bf k}| \le D {|\bf h} | \qquad \text{ whenever } 0 <|{\bf h}| < 1. \] It follows that \({\bf k}\to \bf0\) as \(\bf h \to \bf0\), and hence that \[\begin{equation}\label{cr.p2} \lim_{\mathbf h \to {\bf 0}}\frac 1{|\bf h|} |\mathbf E_{\mathbf f, \mathbf b}({\bf k})| \le \lim_{\bf k \to {\bf 0}}D \frac 1{|\bf k|} |\mathbf E_{\mathbf f, \mathbf b}({\bf k})| = 0. \end{equation}\] Finally, we deduce \(\eqref{cr.proof}\) by adding up \(\eqref{cr.p1}\) and \(\eqref{cr.p2}\).

Differentiate, then substitute

With the chain rule, it is common to get tripped up by ambiguous notation. For example, suppose we are given \(f:\R^3\to \R\), which we will write as a function of variables \((x,y,z)\). Further assume that \(\mathbf G:\R^2\to \R^3\) is a function of variables \((u,v)\), of the form \[ \mathbf G(u,v) = (u, v, g(u,v)) \qquad\text{ for some }g:\R^2\to \R. \] Let’s write \(\phi = f\circ \mathbf G\). Then a routine application of the chain rule tells us that \[ ( \partial_u \phi \ \ \ \partial_v \phi ) \ = \ \left(\begin{array}{ccc} \frac{\partial f}{\partial x}\circ \mathbf G & \ \frac{\partial f}{\partial y}\circ \mathbf G &\ \frac{\partial f}{\partial z}\circ \mathbf G \end{array}\right) \left(\begin{array}{cc} 1&0\\ 0&1 \\ \frac{\partial g}{\partial u}& \frac{\partial g}{\partial v} \end{array}\right) \] For simplicity, considering only the \(u\) derivative, this says that

\[ \frac{\partial \phi}{\partial u}(u,v) = \frac{\partial f}{\partial x} (u,v,g(u,v)) + \frac{\partial f}{\partial z}(u,v,g(u,v)) \frac {\partial g}{\partial u}(u,v). \] This is perfectly correct but a little complicated. After all, since \(x=u\) and \(y=v\), it might be simpler to write \(\mathbf G\) as a function of \(x\) and \(y\) rather than \(u\) and \(v\), ie \(\mathbf G(x,y) = (x,y,g(x,y))\). Then we would write \[ \phi(x,y) = f(\mathbf G(x,y)) = f(x,y,g(x,y)), \] and by changing \((u,v)\) to \((x,y)\), our formula for the derivative becomes \[\begin{equation} \frac{\partial \phi}{\partial x}(x,y) = \frac{\partial f}{\partial x} (x,y,g(x,y)) + \frac{\partial f}{\partial z}(x,y,g(x,y)) \frac {\partial g}{\partial x}(x,y). \label{wo05}\end{equation}\] However, this is a little ambiguous, since if someone sees the expression \[\begin{equation} \frac{\partial f}{\partial x} (x,y,g(x,y)) \label{wo1}\end{equation}\] they can be legitimately confused about whether it means

  • first compute the partial derivative with respect to \(x\), then substitute \(z=g(x,y)\), OR

  • first substitute \(z=g(x,y)\), then compute the partial derivative with respect to \(x\).

The ambiguity could be resolved by using more parentheses to indicate the order of operations. It would be clear if we write

  • \(\left(\dfrac{\partial f}{\partial x}\right) (x,y,g(x,y))\) to mean “first differentiate then substitute,” and
  • \(\dfrac{\partial }{\partial x} \left(f (x,y,g(x,y))\right)\) to mean “first substitute then differentiate.”

Writers almost never do this, possibly because it is hard to parse quickly and looks clunky by having many parentheses. Omitting the parentheses is similar to writing \(5+2\cdot 3\), and hoping that our reader does not interpret this to mean \((5+2)\cdot3\). We need to establish a convention, and in this case the first interpretation is conventional. The second interpretation is exactly what we called \(\frac{\partial \phi}{\partial x}\). Expressions like \(\eqref{wo1}\) can be confusing, and \(\eqref{wo05}\) is only correct if the reader is able to figure out exactly what it means.

One way to avoid using this problem is to write the derivatives of \(f\) as \(\partial_1 f,\) instead of \(\frac{\partial f}{\partial x}\) or \(\partial_x f\), and similarly \(\partial_2 f\), etc. Then \(\eqref{wo05}\) looks like \[\begin{equation}\label{last} \partial_1\phi(x,y) = \partial_1 f(x,y,g(x,y)) + \partial_3 f(x,y,g(x,y)) \partial_1g(x,y), \end{equation}\] and this is correct and unambiguous, though still a little awkward. It can be written more simply as \[ \partial_1\phi= \partial_1 f + \partial_3 f\partial_1g, \] as long as we trust our readers to figure out that derivatives of \(g\) are evaluated at \((x,y)\) and derivatives of \(f\) at \((x,y,g(x,y))\).

One can also get into more serious trouble, for example as follows. Let’s continue to write \(g\) as a function of \((x,y)\) rather than \((u,v)\), and let’s also write \(w\) as the name of the variable that is the output of the function \(f\), that is, \(w = f(x,y,z)\) Then we can write \[ w = f(x,y,z) \qquad \text{ where } \qquad z = g(x,y) \] Suppose we want to know about rates of change in \(w\) in response to infinitesimal or small changes in \(x\), always restricting our attention to the set of points where \(z=g(x,y)\). Since \(z\) depends on \(x\), we have to use the chain rule. If we use the notation \(\eqref{cr.trad}\), we might write \[ \frac{\partial w}{\partial x} = \frac{\partial w}{\partial x}\frac{\partial x}{\partial x}+ \frac{\partial w}{\partial y}\frac{\partial y}{\partial x}+ \frac{\partial w}{\partial z}\frac{\partial z}{\partial x}. \] But it is clear that \(\frac{\partial x}{\partial x}= 1\), and that \(\frac{\partial y}{\partial x} = 0\), because \(x\) and \(y\) are unrelated. Thus the above equation reduces to \[\begin{equation} \frac{\partial w}{\partial x} = \frac{\partial w}{\partial x} + \frac{\partial w}{\partial z}\frac{\partial z}{\partial x}. \label{cr.example}\end{equation}\] It follows that \[\begin{equation}\label{wrong} \frac{\partial w}{\partial z}\frac{\partial z}{\partial x} = 0. \end{equation}\] This is worse than ambiguous — it is wrong! For example, suppose that \(f(x,y,z) = z\) and \(g(x,y) = x\). Then it is clear that \(\frac{\partial w}{\partial z}\frac{\partial z}{\partial x} =1\), showing that \(\eqref{wrong}\) is cannot be true.

The problem is that here we have written \(\frac{\partial w}{\partial x}\) to mean two different things: on the left-hand side, it is \(\partial_1 \phi(x,y)\), and on the right-hand side it is \(\partial_1 f(x,y,g(x,y))\), using notation from \(\eqref{last}\). But if we insist on using the notation \(\eqref{cr.trad}\), then there is no simple way of distinguishing between these two different things.

Summary of this discussion

You can never go wrong if you apply the chain rule correctly and carefully — after all, it’s a theorem. But bad choices of notation can lead to ambiguity or mistakes. You should be aware of this when you are

  • using the chain rule,
  • explaining some application of the chain rule to someone (eg, writing up the solution of a problem), or
  • reading discussions that use the chain rule, particularly if they use notation like \(\eqref{cr.trad}\).

On the other hand, shorter and more elegant formulas are often easier for the mind to absorb. When there is no chance of confusion, this can be a reason to prefer them over complicated formulas that spell out every nuance in mind-numbing detail.

Problems

Basic

Questions involving the chain rule will appear on homework, at least one Term Test and on the Final Exam. Such questions may also involve additional material that we have not yet studied, such as higher-order derivatives. You will also see chain rule in MAT 244 (Ordinary Differential Equations) and APM 346 (Partial Differential Equations). If the questions here do not give you enough practice, you can easily make up additional questions of a similar character. You can also find questions of this sort in any book on multivariable calculus.

  1. Compute derivatives using the chain rule. For example,
  • Suppose that \(f:\R^3\to \R\) is of class \(C^1\), and consider the function \(\phi:\R^2\to \R\) defined by \[ \phi(x,y) = f(x^2-y, xy, x\cos y) \] Express \(\partial_x\phi\) and \(\partial_y \phi\) in terms of \(x,y\) and partial derivatives of \(f\).

  • Suppose that \(f:\R^2\to \R\) is of class \(C^1\), and consider the function \(\phi:\R^3\to \R\) defined by \[ \phi(x,y,z) = f(x^2-yz, xy+\cos z) \]Express partial derivatives of \(\phi\) with respect to \(x,y,z\) in terms of \(x,y,z\) and partial derivatives of \(f\).

  • Suppose that \(f:\R^2\to \R\) is of class \(C^1\). Let \(S = \{(r,s)\in \R^2 : s\ne 0\}\), and for \((r,s)\in S\), define \(\phi(r,s) = f(rs, r/s)\). Find formulas for \(\partial_r\phi\) and \(\partial_s\phi\) in terms of \(r,s\) and derivatives of \(f\).

  • Suppose that \(f:\R^2\to \R\) is of class \(C^1\). Let \(S = \{(x,y,z)\in \R^3 : z\ne 0\}\), and for \((x,y,z)\in S\), define \(\phi(x,y,z) = f(xy, y/z)\). Find formulas for partial derivatives of \(\phi\) in terms of \(x,y,z\) and partial derivatives of \(f\).

  1. Use the chain rule to find relations between different partial derivatives of a function. For example:

Suppose that \(f:\R\to \R\) is of class \(C^1\), and that \(u = f(x^2+y^2+z^2)\). Prove that \[ x\partial_y u - y \partial_x u = 0 \] Suppose that \(f:\R^2\to \R\) is of class \(C^1\), and that \(u = f(x^2+y^2+z^2, y+ z)\). Prove that \[ (y-z)\partial_x u - x \partial_y u + x \partial_z u = 0. \]

  1. Find the tangent plane to the set \(\ldots\) at the point \(\mathbf a = \ldots\). For example:

    • Find the tangent plane to \(\{ (x,y,z)\in \R^3 : x^2e^{y/(z^2+1)} = 4\}\) at the point \(\mathbf a = (2,0,1)\)

Advanced

  1. Let \(q:\R^n\to \R\) be the (quadratic) function defined by \(q(\mathbf x) = |\mathbf x|^2\). Determine \(\nabla q\) (either by differentiating, or by remembering material from one of the tutorials.)
  2. Suppose that \(\mathbf x:\R\to \R^3\) is a function that describes the trajectory of a particle. Thus \(\mathbf x(t)\) is the particle’s position at time \(t\).
  • If \(\mathbf x\) is differentiable, then we will write \({\bf v}(t) = \mathbf x'(t)\), and we say that \({\bf v}(t)\) is the velocity vector at time \(t\).
  • Simlarly, if \(\bf v\) is differentiable, then we will write \({\bf a }(t)= {\bf v}'(t)\), and we say that \({\bf a}(t)\) is the acceleration vector at time \(t\).
  • We also say that \(|{\bf v}(t)|\) is the speed at time \(t\).

Prove that the speed is constant if and only if $ a(t) v(t) = 0$ for all \(t\).

For the next three exercises, let \(M^{n\times n}\) denote the space of \(n\times n\) matrices, We write a typical element of \(M^{n\times n}\) as a matrix \(X\) with entries: \[ X = \left( \begin{array}{ccc} x_{11} & \cdots & x_{1n}\\ \vdots & \ddots & \vdots\\ x_{n1} & \cdots & x_{nn} \end{array}\right) \] Define the function \(\det:M^{n\times n}\to \R\) by saying that \(\det(X)\) is the determinant of the matrix. We can view this as a function of the variables \(x_{11},\ldots, x_{nn}\).

  1. For \(2\times 2\) matrices, compute \[ \frac{\partial}{\partial x_{11}} \det(X), \quad \frac{\partial}{\partial x_{12} }\det(X), \quad \frac{\partial}{\partial x_{21} }\det(X), \quad \frac{\partial}{\partial x_{22} }\det(X). \]

  2. Now consider \(n\times n\) matrices for an arbitrary positive integer \(n\). Let \(I\) denote the \(n\times n\) identity matrix. Thus, in terms of the variables \((x_{ij})\), \(I\) corresponds to \[ x_{ij}=\begin{cases}1&\text{ if }i=j\\ 0&\text{ if }i\ne j \end{cases} \] For every \(i\) and \(j\), compute \[ \frac{\partial}{\partial x_{ij}} \det(I), \] This means: the derivative of the determinant function, evaluated at the identity matrix.

    HintThere are two cases: \(i=j\) and \(i\ne j\).

From the definition of partial derivatives, to compute \(\frac{\partial}{\partial x_{11}}\det(I)\), you have to look at \[ \lim_{h\to 0} \frac 1 h \left[ \det \left( \begin{array}{ccccc} 1+h & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{array}\right) \ - \ \det I\right] \]
  1. Now suppose that \(X(t)\) is a “differentiable curve in the space of matrices”, in other words, that \[ X(t) = \left( \begin{array}{ccc} x_{11}(t) & \cdots & x_{1n}(t) \\ \vdots & \ddots & \vdots\\ x_{n1}(t) & \cdots & x_{nn}(t) \end{array}\right) \] where \(x_{ij}(t)\) is a differentiable function of \(t\in \R\), for all \(i,j\). Also suppose that \(X(0) = I\).

Use the chain rule and the above exercise to find a formula for \(\left. \frac d{dt} \det(X(t))\right|_{t=0}\) in terms of \(x_{ij}'(0)\), for \(i,j=1,\ldots, n\).

  1. Here we sketch a proof of the Chain Rule that may be a little simpler than the proof presented above. To simplify the set-up, let’s assume that \(\mathbf g:\R\to \R^n\) and \(f:\R^n\to \R\) are both functions of class \(C^1\). Define \(\phi = f\circ \mathbf g\). Thus \(\phi\) is a function \(\R\to \R\). Your goal is to compute its derivative at a point \(t\in \R\). To simplify still further, let’s assume that \(n=2\). Let’s write \(\mathbf g(t) = (x(t), y(t))\). Then \[\begin{align*} \phi(t+h)-\phi(t) &= f(\mathbf g(t+h)) - f(\mathbf g(t)) \\ &= f( x(t+h), y(t+h)) - f(x(t),y(t)) \\ &= [f( x(t+h), y(t+h)) - f(x(t+h),y(t))] \\ &\qquad \qquad\qquad+ [f(x(t+h),y(t)) -f(x(t),y(t))] . \\ \end{align*}\] Starting from the above, mimic the proof of Theorem 3 in Section 2.1 to show that \[ \phi'(t) = \lim_{h\to 0}\frac 1 h(\phi(t+h)-\phi(t) ) \text{ exists and equals } \frac{\partial f}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial t}, \] where the partial derivatives of \(f\) on the right-hand side are evaluated at \((x(t),y(t)) = \mathbf g(t)\).

\(\Leftarrow\)  \(\Uparrow\)  \(\Rightarrow\)

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Canada License.