Differential forms, reverse derivatives and machine learning

I was recently trying to convince Bruno that covectors (in the form of differentials of functions) are the real deal in gradient descent, despite the misleading name and countless pictures showing little arrows pointing downhill.

The reasons are two:

Derivatives are really differential forms, not vector fields (huge pedagogical confusion on this point, since derivatives are sold everywhere as 'tangent vectors at a point').
Reverse derivative (the star of backpropagation) is actually pullback of differential forms in disguise.

I believe the culprit of the confusion is the constant abuse of Euclidean spaces and the conflation of their tangent vectors with their points, or even worse, the habit of identifying tangent fibers on different points. Euclidean spaces are cool but very special as manifolds. Therefore if you want to know the full story of differential geometry you really shouldn't focus on them as an example.

Add to this the ubiquity of Riemannian and symplectic structures, which allow one to freely identify tangent and cotangent spaces by raising and lowering indices, and you get a recipe for disaster.

Derivatives are differential forms

Let's look at the first idea: derivatives are differential forms. This is almost tautological in the way things are defined, since the cotangent bundle is defined exactly as the bundle of differentials over a manifold hence of course the differential of a function is a section of this.

On the other hand, one may understand it in the following way: a smooth function is also a map of smooth manifolds, say \( f:M \to \mathbb R \), therefore it induces a map between its tangent bundles \( Tf : TM \to T\mathbb R \). But then each fiber of \( T\mathbb R \) can be canonically identified with a single copy of \( \mathbb R \), so that \( Tf \) is a fiberwise linear functional, i.e. a differential form.

In 1-dimensional Euclidean differential geometry, aka calculus, this is witnessed by the definition \[ df = f' dx \] which students usually think of as a byproduct of \[ \dfrac {df}{dx} = f' \] at which teachers angrily react by screaming yOu cAn'T diViDe bY a diFFeRenTial!

In general, working in local coordinates one still gets \[ df = \dfrac {\partial f}{\partial x_i} dx^i \] which could make one think, well, what's wrong in defining \[ \partial f = \dfrac {\partial f}{\partial x_i} \partial x_i \] which, by the way, is obviously true: you can simplify \( \partial x_i \) and get an identity (inflicting mortal damage to the teacher left agonizing from the previous paragraph).

This works locally: indeed, the coordinate patch you choose allows you to pretend you're actually working in some patch of \( \mathbb R^n \), where it really works. To make it work globally, you need to make sure the identification \( dx^i \mapsto \partial x_i \) you choose glues along with your coordinate patches, i.e. can be extended to a global bundle isomorphism \( TM \cong T^*M \).

This can be made more familiar if we realize that the data of an isomorphism \( \varphi : V \to V^* \) for a finite dimensional vector space is equivalently expressed as a non-degenerate bilinear form on \( V \) (i.e. bilinear maps \( \langle -, = \rangle : V \times V \to V \) which are iso on either slot). This is Riesz's theorem in finite dimension, basically, and it's readily proven: given \( \varphi \), one can define \( \langle v, w \rangle = \varphi (v)(w) \), and given \( \langle -, = \rangle \) one can define \( \varphi (v) = \langle -, v \rangle \) *mumbles something about Yoneda*.

Therefore this global correspondence we are looking after is equivalently a smooth choice of such bilinear forms on each tangent space. When these forms are symmetric (i.e. \( \langle v, w \rangle = \langle w, v\rangle \)), the data is called a Riemannian metric. When they're skew-symmetric (\( \langle v, w \rangle = -\langle w, v\rangle \)) it's a symplectic form. Also using (one of) the isomorphisms induced by a bilinear form is called raising or lowering indices, especially in the context of Riemannian geometry, where these two operations are denoted with \( \sharp \) and \( \flat \), respectively.

Since Riemannian and symplectic structures are very common in the wild, the distracted pupil, full of hỳbris, might think that \( \partial f \) (which is then called the gradient of \( f \)) is as intrinsic as \( df \) and thus that vectors triumphed over covectors.

But they flied too close to the Sun and burned themselves, because gradients are not intrinsic objects! They depend on the specific choice of Riemannian or symplectic structure, and these choices can be wildly non unique. Moreover, 'being a gradient' is not even invariant under change of metric or symplectic structure!

In other words, derivatives are not vector fields, but differential forms!

Reverse derivative is pullback of forms

Machine learning is firmly grounded on backpropagation these days, which is an algorithm that allows to compute 'the gradient' of a loss function in an efficient (compositional) way.

Let's be more explicit: suppose you have data set \( (x,y) \) where \( x = x_1, \ldots , x_n \) are examples and \( y = y_1, \ldots , y_n \) are their labels. The goal of (most of) machine learning is to create models which can generalize this correspondence to new \( x' \) never seen before. The classic example is when \( x \) is a set of images, \( y \) is a set of labels such as 'cat', 'dog', 'bird', and so on, and I want to build a model which can recognize cats, dogs, birds and so on in images never seen before.

When I say 'model' I just mean a function from \( x \)s to \( y \)s. When I say 'build' I mean trying to come up with a nice form for this function (e.g. a neural network) and then training it as we train calculus students: show them a problem, ask them to solve it, and then evaluate their solutions by comparing them with the correct ones we know in advance.

So we have a function \( E(x,y') \) which given the problem set and the tentative solution \( y' \), grades the student with a number, except that usually in machine learning \( 0 \) is the highest (suspiciously high, actually) mark. In fact \( E \) is called error or loss, and evaluates 'how much wrong' the proposed solution is.

As in teaching, model training works in rounds: after each assignment, we (hopefully) tell students what they did wrong so that next time they will (hopefully) perform better. Brains have figured out how to update themselves inbetween evaluation round, but machine learning models not yet (emphasis on yet). So the trainer (usually in the form of an automated algorithm) looks at the loss, and tunes the model accordingly.

This automated algorithm is 99% of the time a form of gradient descent, which works in the following way: to minimize the loss, just go down its gradient until you find a minimum. So, as in a calculus course, gradient descent works under the assumption that learning means minimizing the loss, also called training error.

Now, the problem backpropagation solves is this: if the model is a very complicated function \( f \), computed in many possibly non-linear steps, how do we efficiently compute the gradient of \( E(x, f(x)) \)?

The idea is to 'propagate back' (hence the name) such a gradient by exploiting the chain rule to break down the gradient of the 'big' \( f \) (say, a neural networks) into the many gradients of its 'small' constituents (say, layers or neurons of the network).

Long story short, this can be done by extracting, from a function \( f : \mathbb R^m \to \mathbb R^n \), its reverse derivative \( Rf : \mathbb R^m \times \mathbb R^n \to \mathbb R^m \), defined as \[ Rf(x, y) = y^\top J_f(x) . \] Here \( J_f \) is the Jacobian matrix of \( f \), which is the coordinate expression of its derivative.

Then the reverse derivative of a composite is the composite of the reverse derivatives (if we perform composition right, i.e. we consider the pair \((f, Rf)\) a lens, more here).

Neat!

But what's a reverse derivative, for Levi-Civita's sake? The biggest hint to the truth is given by the way reverse derivatives are sometimes written in the 1-dimensional case: \[ Rg(x, dy) = g'(x) dy . \] Despite being nonsensical with the typing I gave you (which is the usual type they're given, unfortunately), it points us the right direction: first, let's rename \( \mathbb R^n \) and \( \mathbb R^m \) as \( N \) and \( M \), to not be tempted to use their Euclidean pudding of structure. Then \( f \) has now type \( M \to N \).

Then we take the hint above to type \( Rf \) as \( M \times T^* N \to T^* M \). So now that \( dy \) really makes sense. Also it's important to point out \( M \times T^*N \) should really be the pullback of \( T^* N \) along \( f \) (which, incidentally, is a Cartesian product if done in the right category), since we want our assignment to respect the structure of the cotangent bundle.

Now, what's pullback of differential forms? A smooth map of manifolds induce a map between their tangent bundles, its differential. Then its fiberwise dual is the pullback of forms. Concretely, given \( f:M \to N \), pullback of forms is the map of bundles \( f^* : T^*N \to T^*M \) given on a differential form \( \alpha \) on \( N \) by \[ f^* \alpha = \alpha \circ df \] To see this is indeed the reverse derivative, let's go back to the Euclidean case: \( df \) is the Jacobian of \( f \), \( J_f \), and if we represent \( \alpha \) in matrix form, it is given on each point by a row vector, because dualizing in the Euclidean case corresponds to transposition. Thus if \( y \) represents \( \alpha \), \( y^\top \) is its matrix form.
Finally, since composition of linear maps corresponds to multiplication of their associated matrices, we retrieve the initial expression.

Epilogue

Although in most cases Euclidean spaces are all it's needed, it makes sense to have a differential geometric expression of reverse derivative for two reasons: first, it may help to clarify what's going on, and second, it immediately generalizes to learning on manifolds, which isn't exactly useless.

Also an interesting corollary of all this story is that, in general, reverse derivatives do not organize in plain lenses, but instead in generalized lenses in the sense of Spivak. It also helps to see the difference between the horizontal and vertical category of Myers' differential doctrines, if you know what I'm talking about: horizontal is about mapping systems in other systems, therefore \( T \) are used there, while in the vertical category of systems, \( T^* \) is to be used.

To finish my rant, let me also explain why it makes sense, conceptually, to backpropagate a covector and not a vector.

First of all, since we've seen in the previous section that derivatives are covectors, which can be turned into vectors (the gradient) by raising indices with a metric or a symplectic form, it should be obvious that the problem of propagating the derivative is actually the problem of propagating a covector.

Secondly, what is, intuitively, the differential of a function? It's an infinitesimal representation of a function: at a point \( x \), \( df \) is the infinitesimal change of \( f \) along a given direction vector \( v \in T_x M \). Apply this to a loss: it gives us an infinitesimal representation of the loss around the current choice of parameters of our model (say, the weight of a neural network), enabling us to 'judge' each proposed change of parameters (a tangent vector). We then use gradients because we want to actually move from our current parameter to a new one, and motions are given by vectors. A Riemannian metric, then, allows us to implement such a change by moving along a geodesic emanating from the parameter we are now (this is gradient descent on Riemannian manifolds). Choose a different metric and the same loss function will select a different direction of descent, because we changed what 'down' means (think: geodesic motion in general relativity).

Further work

Interestingly, this whole business of losses as differential forms is what Hamiltonian mechanics is based on. That is, while the dynamics of a system are encoded by vector fields, the energy landscape of a system is encoded in a function \( h \) which has the same role of a loss function. Dynamics is extracted by means of a symplectic form which implements the changes entailed by \( h \) into actual motions. This is explained in Section 3.2 of my notes about the symplectic formulation of Hamiltonian mechanics.

The similarity is so striking that one could bet that Euler-Lagrange equations may actually be used to express the solution of machine learning problems.