# What is Automatic Differentiation?¶

**A.D.**) is a technique to evaluate the

**derivatives**of a function defined by a computer

**program**.

**How can one obtain derivatives?**

**Automatic Differentiation**

**Computer Programs and Mathematical Functions**

**Activity of variables**

**Focus on the directional derivatives (tangents)**

**Focus on the gradients**

**Focus on multi-directional mode**

## How can one obtain derivatives?

There are many ways to obtain derivatives. Maybe the most straightforward is to go back to the**equations**that led to the program. Suppose that some result is defined by some equations, and the program solved these equations for this result. Then one can write a new set of equations, whose solutions are the derivative of the initial result. Consequently, one can write a new program that solves these new equations for the desired derivatives. This is mathematically very sound, but it probably is the most

**expensive**way, since it implies discretizing new equations, then writing a new program. We all know how difficult this is, and how many errors can be done! Moreover, in some cases there are no simple original equations, and only the program is at hand. Therefore we are looking for another way, more economical and that uses only the original program.

Using only the original program, one can naturally do

**divided differences**. For a given set of program's inputs,

*, the program*

**X***has computed a result*

**P***. In the general case, both*

**P(X)=Y***and*

**X***are*

**Y****vectors**, i.e. are composed of several real numbers. Given now some normalized direction

*in the space of the inputs, one can run the program*

**dX***again, on the new set of inputs*

**P***, where is some very small positive number. Then a good*

**X+dX****approximation**of the derivative is computed easily by:

The

**centered divided differences**, computed by:

usually give a better approximation, but cost an additional execution of program

*. Of course this is just an approximation of the derivatives. Ideally, the exact derivative is the limit of these formulas, when tends to zero. But this makes no sense on a real computer, since very small values of lead to truncation errors, and therefore to erroneous derivatives. This is the main drawback of divided differences: some tradeoff must be found between truncation errors and approximation errors. Finding the best requires numerous executions of the program, and even then the computed derivatives are just approximations.*

**P**## Automatic Differentiation :

**Automatic Differentiation**, just like divided differences, requires only the original program

*. But instead of executing*

**P***on different sets of inputs, it builds a new, augmented, program*

**P***, that computes the*

**P'****analytical**derivatives along with the original program. This new program is called the

**differentiated**program. Precisely, each time the original program holds some value

*, the differentiated program holds an additional value*

**v***, the*

**dv****differential**of

*. Moreover, each time the original program performs some operation, the differentiated program performs additional operations dealing with the differential values. For example, if the original program, at some time during execution, executes the following instruction on variables*

**v**`a`,

`b`,

`c`, and array

`T`:

`a = b*T(10) + c`

then the differentiated program must execute additional operations on the variables and their

**differentials**

`da`,

`db`,

`dc`, and array

`dT`, that must somehow amount to:

`da = db*T(10) + b*dT(10) + dc`

The derivatives are now computed analytically, using the well known formulas on derivation of elementary operations. Approximation errors have just vanished.

Again, there are two ways to implement A.D.:

**Overloading**consists in telling the compiler that each real number is replaced by a pair of real numbers, the second holding the differential. Each elementary operation on real numbers is overloaded, i.e. internally replaced by a new one, working on pairs of reals, that computes the value**and**its differential. The advantage is that the original program is virtually unchanged, since everything is done at compile time. The drawback is that the resulting program runs slowly because it constantly builds and destroys pairs of real numbers. Moreover, it is very hard to implement the "**reverse mode**" with overloading.**Source transformation**consists in adding into the program the new variables, arrays, and data structures that will hold the derivatives, and in adding the new instructions that compute these derivatives. The advantage is that the resulting program can be compiled into an efficient code, and the "**reverse mode**" is possible. The drawback is that this is an enormous transformation, that cannot be done by hand on large programs. Tools are needed to perform this transformation correctly and rapidly. Our team studies this sort of tools. Our Tapenade engine is just one such Automatic Differentiation tool that uses source transformation.

**differentials**and

**derivatives**. This is because there are many "derivative objects" one may want, and each of them is obtained through a specific Automatic Differentiation. Here are some examples:

- One may really want the
**derivative**of the program. More exactly, since the program's inputs and outputs are generally**vectors**, the derivative is a matrix, known as the**Jacobian**matrix, Each elementof the Jacobian matrix is the partial derivative**(i,j)**

of outputwith respect to input**Y**_{i}. For that, the derivative object that must be computed along with each value**X**_{j}is the**v****vector**of all partial derivatives ofwith respect to each input. When values are themselves arrays, the derivative object becomes a matrix. This leads to carrying partial Jacobian matrices along with the main computation stream. This is usually expensive in time and memory space. This drawback can be partially coped for, using sparse representations for the derivative objects.**v** - One may simply want
**directional derivatives**, also known as**tangents**. This means that, instead of the explicit**Jacobian**matrix, one just needs its effect on a given direction vector**J**in the input space. This effect is actually**dX**, but one can evaluate it without explicitly computing**dY = J * dX**. For that, the derivative object that must be computed along with each value**J**is its "**v****differential**", i.e. (the first-order approximation of) the quantity by which**dv**varies when the input**v**varies by**X**. As expected, this is far less expensive than building the whole Jacobian. Note that directional derivatives also can provide us with the**dX****Jacobian**matrix, if directional deriatives are computed for each canonical direction in the input space. In other words, this requires one computation of the directional derivative per input variable, and even less than that, if one knows that the Jacobian matrix is sufficiently sparse. - One may want
**gradients**. When there is only one output, the Jacobian matrix has only one line, which is exactly the gradient of the result with respect to the inputs. When there are many outputs, one defines a scalar composite result, i.e. some linear combination of these outputs, so as to compute the gradient of this result. To put it differently, one wants a combination of the lines of the Jacobian. The "*" superscript indicates transposition, i.e.**Xb**^{*}= Yb^{*}* Jand**Xb**^{*}are row vectors. This can be computed without explicitly computing**Yb**^{*}. For that, the derivative object is called an "**J****adjoint**", and it is computed in the reverse of the original program's order, and will be shown later. Like with the directional derivatives, one can compute the whole Jacobian matrix by repeatedly computing gradients, for each canonical direction in the output space, or fewer than that when the Jacobian is sparse. It is easy to check that this approach is advisable when there are far more inputs than outputs. - One may want
**higher-order derivatives**(Hessian). This can be done computing a derivative object which is a partial Hessian tensor. Since this might be very expensive, another approach is to computehigher-order derivatives, and use them afterwards to recompute the coefficients of the Hessian. (cf [GriewankUtkeWalther97])**directional** - One may want
**directional higher-order derivatives**. Given a direction vectorin the input space, one computes along the original program a derivative object that contains the first and higher derivatives of each intermediate value in the direction*dX*. Higher-order derivatives are useful to get a more accurate approximation, or to optimize more efficiently.*dX* - One may want
**Taylor series**expansions. (cf [Griewank00]) - One may want to compute
**intervals**. This is not strictly speaking a matter of derivatives, but shares many similarities. The derivative object computed along with each valueis the interval into which this value may range.**v**

**directional derivatives**and

**gradients**. We implemented these two modes of Automatic Differentiation in the Tapenade tool, the successor of

**Odyssée**. Some years ago the former

**SAFIR**team of INRIA, who developped Odyssée, implemented also the

**Jacobian**mode to validate their research. This mode has not been maintained in Odyssée, and therefore is not presently available in Tapenade. The

**TAMC**(now

**TAF**) tool also computes

**directional derivatives**and

**gradients**. The

**ADIFOR**tool computes

**directional derivatives**,

**gradients**, and

**Jacobians**, taking advantage of sparsity. All the above tools are based on

**source transformation**. In contrast

**ADOL-C**, based on

**overloading**, has been used for

**directional derivatives**and

**gradients**, but these are relatively slow, and also for

**directional higher-order derivatives**and

**Taylor series**. Other very close tools exist, such as

**PADRE-2**, mainly used for

**error estimation**.

## Computer Programs and Mathematical Functions :

Before we can focus on**directional derivatives**and

**gradients**, we need to introduce a general framework in which we can describe precisely these two modes of A.D. Essentially, we identify

**programs**with

**sequences of instructions**, themselves identified with

**composed functions**.

Programs contain

**control**, which is essentially discrete and therefore non-differentiable. We identify a program with the set of all possible run-time sequences of instructions. Of course there are

**many**such sequences, and therefore we never build them explicitly. The

**control**is just how one tells the running program to switch to one sequence or another. For example this small program piece:

if (x <= 1.0) then printf("x too small\n"); else { y = 1.0; while (y <= 10.0) { y = y*x; x = x+0.5; } }

printf("x too small\n");

y = 1.0;

y = 1.0; y = y*x; x = x+0.5;

y = 1.0; y = y*x; x = x+0.5; y = y*x; x = x+0.5;

y = 1.0; y = y*x; x = x+0.5; y = y*x; x = x+0.5; y = y*x; x = x+0.5;

*and so on...*

Each of these sequences is "differentiable". The new program generated by Automatic Differentiation contains of course some control, such that each of its run-time sequences of instructions computes the derivative of the corresponding sequence of instructions in the initial program. Thus, this differentiated program will probably look like (no matter whether by overloading or by source transformation):

if (x <= 1.0) then printf("x too small\n"); else { y = 1.0; dy = 0.0; while (y <= 10.0) { dy = dy*x + y*dx; y = y*x; x = x+0.5; } }

`x`may change the number of iterations of the

`while`loop, and the derivative is not defined any more. Yet the new program generated by Automatic Differentiation will return a result, and using this derivative may lead to errors. In Other words, the original program, with control, is only

**piecewise differentiable**, and most present A.D. tools do not take this into account correctly. This is an open research problem. In the meantime, we simply assume that this problem happens rarely enough. Experience on real programs shows that this is a reasonable assumption.

Now that programs are identified with sequences of instructions, these sequences are identified with

**composed functions**. Precisely, the sequence of instructions :

*I*_{1}; I_{2}; ... I_{p-1}; I_{p};is identified to the function :

Of course each of these functions are extended to operate on the domain of all the program variables, but variables not overwritten by the instruction are just transmitted unchanged to the function's result. We can then use the chain rule to formally write the derivative of the program for a given vector input

*x*:

We wrote

*f'*for the derivatives of function

_{n}*f*. The

_{n}*f'*are thus Jacobian matrices. To write the above in a shorter form, we introduced

_{n}*x*, which is the values of the variables just after executing the first

_{n}*n*functions (we set

*x*=

_{0}*x*).

Therefore, computing the derivatives is just computing and multiplying these elementary Jacobian matrices

*f'*

_{n}(x_{n-1})## Activity of variables :

In practice, not all the elements of the Jacobian matrix need to be computed. Some elements can be proved to be always null, some others are simply not needed by the end user. We can take this into account to simplify the differentiated program. This is summarized by the notion of**activity**of an instance of a variable. The end-user specifies a set of output variables, for which the derivatives are requested. These are called the

**dependent**variables. The end-user also specifies a set of input variables, with respect to which the

**dependent**output must be differentiated. These are called the

**independent**variables. We shall say that an instance of a variable is

**active**if

- it depends on some independent, in a differentiable way, and
- some dependent depends on it, in a differentiable way.

**Activity**can be found by a static analysis. The set of the

**varying**variables, those who depend on some

**independent**, is propagated forwards through the program. At the beginning of the program, the

**varying**variables are just initialized to the

**independent**. Similarly, the set of the

**useful**variables, those whose value influence on some

**dependent**, is propagated backwards through the program. At the end of the program, the

**useful**variables are just initialized to the

**dependent**. Finally, at each point in the program, a variable is

**active**if it is both

**varying**and

**useful**. Only the derivatives of

**active**variables need to be computed.

## Focus on the directional derivatives (tangents):

Computing the product of all the elementary Jacobian matrices returns the complete Jacobian matrix*of*

**J***f*. This is certainly expensive in time and space. Often, one just wants a directional derivative

*. If we replace*

**dY = J * dX***by its above expansion with the chain rule, we obtain :*

**J =**f'(x)which must be computed from right to left, because "matrix time vector" products are so much cheaper than "matrix times matrix" products. Things are easy here, because

*x*is required first, then

_{0}*x*, and so on. Therefore differentiated instructions, that compute the Jacobian matrices and multiply them, can be done along with the initial program. We only need to interleave the original instructions and the derivative instructions. This is called the

_{1}**tangent mode**of Automatic Differentiation.

In the tangent mode, the differentiated program is just a copy of the given program, Additional derivative instructions are inserted just before each instruction that involves

**active**variables. Each

**active**variable gives birth to a new variable, of same type and size, which is called its derivative variable. The control structures of the program are unchanged, i.e. the Call Graph and Flow Graph keep the same shape.

Here is an example. Consider the original (piece of) program:

... v2 = 2*sin(v1) + 5 if (v2>0.0) then v4 = v2 + p1*v3/v2 p1 = p1 + 0.5 endif call sub1(v2,v4,p1) ...

`v1, v2, v3`and

`v4`are

**active**, as defined above. Then AD introduces new differentiated variables, that we shall write

`dv1, dv2, dv3`and

`dv4`. Differentiated variables retain the type and dimension of the original variable. The tangent mode AD will generate the corresponding piece of program, with differentiated instructions (in blue) inserted just before their original instruction:

... dv2 = 2*cos(v1)*dv1 v2 = 2*sin(v1) + 5 if (v2>0.0) then dv4 = dv2*(1-p1*v3/(v2*v2)) + dv3*p1/v2 v4 = v2 + p1*v3/v2 p1 = p1 + 0.5 endif call sub1_d(v2,dv2,v4,dv4,p1) ...

`if-then-endif`), and instructions dealing with non-active variables are kept and not differentiated.

Intrinsic functions, such as

`sin`, are differentiated directly, "in line", using an external configuration file which specifies, for example, that the (partial) derivative of

`sin(x)`with respect to

`x`is

`cos(x)`

The structure of subroutine calls is also preserved, but differs slightly: Since a routine call is in fact a shorthand for another piece of program, and that this piece of program will be augmented with differentiated instructions too, the call to the differentiated routine

`sub1_d`simply

*replaces*the call to

`sub1`. In other words

`sub1_d`does the previous work of

`sub1`

*plus*the derivatives computations. Of course, these derivative instructions use the derivative variables, and therefore each parameter which is active at the beginning or at the end of the routine requires an extra parameter holding its derivative.

Consider now the special case of a call to a

**black-box**routine. This is basically the same as the FORTRAN

`EXTERNAL`routines but we generalize it as follows: we call

**black-box**a routine on which we cannot, or must not, or do not want to run AD! This can be because we don't have the source for it (

`EXTERNAL`routines), or because we want to provide a much better hand-written derivative routine. In this case, AD of the calling program needs additional information about the black-box routine. Suppose that

`sub1`above is black-box. If AD knows nothing about it, then it must make the conservative assumption that

`p1`, although not active at call time, may become active at return time. For example, its exit value might depend on the active input

`v2`. Therefore AD must introduce a derivative

`dp1`, initialized to

`0.0`before the call, and containing some nontrivial result upon return. The last lines of the differentiated code should be:

... dp1 = 0.0 call sub1_d(v2,dv2,v4,dv4,p1,dp1) ...

**black-box**routines. For example, the following matrix indicates that the exit

`p1`does not depend on any active entry variable.

This matrix can capture precise behavior. For example it also implies here that although the entry active

`v2`is used to compute the exit, active,

`v4`, the exit

`v2`is not active any more, because it only depends on the input

`p1`, which is not active. When AD is provided with this dependence matrix for black-box

`sub1`(e.g. in an external configuration file), it generates a program that calls

call sub1_d(v2,dv2,v4,dv4,p1)

`sub1_d`, with correct arguments at the correct places, everything will run fine!

+ non-specialization principle for multiple routine calls + specifics of commons, data structures non-specialization principle.

This is all we will say here about the general principles of AD in direct mode. To go further, you may read the Tapenade reference article and the Tapenade 2.1 users guide, that will describe conventions about variable names and declarations, differentiation of individual instructions and control structures, and in general all you must know to understand and use differentiated programs produced by Tapenade.

## Focus on the gradients :

Whereas the tangent mode of AD is maybe the most immediate, it turns out that many applications indeed need**gradients**. The reverse mode of AD does just that! In fact, the reverse mode computes gradients in an extremenly efficient way, at least theoretically. This is why it is one of the most promising modes of AD.

Strictly speaking, the gradient of a function

*f*is defined only when

*f*returns a scalar. In other words, there can be only one optimization criterion. If

*f*returns a vector

**Y**of results, we must define from

**Y**a scalar composite result, i.e. a linear combination of its elements. We write this scalar optimization criterion as a dot product

**Y**. The gradient of this scalar optimization criterion is therefore, after transposition:

Again, this is most efficiently computed from right to left, because matrix-times-vector products are so much cheaper than matrix-times-matrix products. This is the principle of the reverse mode.

This turns out to make a very efficient program, at least theoretically (cf Section 3.4 of [Griewank00]). The computation time required for the gradient is only a small multiple of the run time of the original program. It is independent from the number of inputs in

**X**. Moreover, if there are only a small number

*m*of scalar outputs in

**Y**, one can compute the whole Jacobian matrix row by row very efficiently.

However, we observe that the

**X**

_{n}are required in the

**inverse**of their computation order. If the original program

**overwrites**a part of

**X**

_{n}, the differentiated program must restore it before it is used by

*f'*(

_{n+1}**X**

_{n}).

**This is the biggest problem of the reverse mode !!**

**Recompute All (RA):**the**X**_{n}is recomputed when needed, restarting the program on input**X**_{0}until instruction. The TAF tool uses this strategy. The following figure illustrates the`I`_{n}**RA**strategy:One can see that before the adjoint of each instruction, the necessary part of the original program is recomputed. Notice that this requires that we save the environment before the rirst run (big black dot), and restore it (big white dots) before each duplicate run. We see that brute-force **RA**strategy has a quadratic time cost with respect to the total number of run-time instructions*p*.**Store All (SA):**the**X**_{n}are restored from a stack when needed. This stack is filled during a preliminary run of the program, the**"forward sweep"**, that additionally stores variables on the stack just before they are overwritten. The tools Adifor and Tapenade use this strategy. The following figure illustrates the**SA**strategy:One can see that before each instruction of the forward sweep, some values are saved, (small black dots) and they are restored (small white dots) just before their corresponding derivative instruction during the so-called **backward sweep**. Brute-force**SA**strategy has a linear memory cost with respect to*p*.

**RA**and

**SA**strategies need a special storage/recomputation trade-off in order to be really profitable, and this makes them become very similar. This trade-off is called

**checkpointing**. Checkpointing consists in selecting some part of the program, and in

**not**storing the values overwritten by this part (in

**SA**mode) nor repeating execution of this part (in

**RA**mode), at the cost of some extra memorization of the complete state of the program, that we call a

**snapshot**.

- In
**Recompute All (RA)**mode, the result of checkpointing part**P**is shown on the top of next figure. The**snapshot**(big black dot) memorizes the state of the program just after**P**.The bottom part of the figure shows the execution of the program when there are many nested checkpoints. There are fewer duplicate executions, but it costs some extra snapshots. - In
**Store All (SA)**mode, the result of checkpointing part**P**is shown on the top of next figure. The**snapshot**(big black dot) memorizes the state of the program just before**P**. There is an additional cost, which is the duplicate execution of**P**.The bottom part of the figure shows the execution of the program when there are many nested checkpoints. There are fewer values memorized, but it costs some snapshots and some duplicate executions.

**Store All (SA)**, and the nested checkpointed parts are the subroutine calls. In the model case where the program is a sequence of

*parts, all of them with grossly the same cost. it was shown that the cost of the reverse differentiated program, in terms of memory size and in terms of number of duplicated executions, grow only like the logarithm of*

**N***, which is very good.*

**N****In the sequel, we shall concentrate on the Store All (SA) strategy.**

The structure of a reverse differentiated program is therefore more complex than in the tangent mode. There is first a

**forward sweep**, identical to the original program augmented with instructions that memorize the intermediate values before they are overwritten, and instructions that memorize the control flow in order to reproduce the control flow in reverse during the

**backward sweep**. Then starts the

**backward sweep**, that actually computes the derivatives, and uses the memorized values both to compute the derivatives of each instruction and to choose its flow of control. Here is the same example as for the tangent mode. Consider again this original (piece of) program:

`v1, v2, v3`and

`v4`are

**active**, as defined above. Then AD introduces new differentiated variables, that we shall write

`v1b, v2b, v3b`and

`v4b`. Here also, differentiated variables retain the type and dimension of the original variable. Suppose that memorizations are done with a stack, used through procedures

`push`and

`pop`. The

**forward sweep**will look like:

... push(v2) v2 = 2*sin(v1) + 5 if (v2>0.0) then push(v4) v4 = v2 + p1*v3/v2 push(p1) p1 = p1 + 0.5 push(true) else push(false) endif push(v2) push(v4) push(p1) call sub1(v2,v4,p1) ...

`push`are memorizations of intermediate values and control. The group of three

`push`before the subroutine call is a

**snapshot**because the call will be

**checkpointed**. The corresponding

**backward sweep**will then be:

... pop(p1) pop(v4) pop(v2) call sub1_b(v2,v2b,v4,v4b,p1) pop(control) if (control) then pop(p1) pop(v4) v3b = v3b + v4b*p1/v2 v2b = v2b + v4b*(1-p1*v3/(v2*v2)) v4b = 0.0 endif pop(v2) v1b = v1b + v2b*2*cos(v1) v2b = 0.0 ...

`sub1_b`is the result of checkpointing on

`sub1`, and therefore itself consists of a

**forward sweep**followed by a

**backward sweep**. On the other hand, intrinsic functions are better differentiated in line, like here where there is no call to a

`sin_b`procedure, but rather an explicit use of the partioal derivative

`cos(v1)`.

The additional parameters passed to the call to

`sub1_b`strongly depend on the results of

**activity analysis**. If an argument is not active at the entry into the procedure nor at the exit from the procedure, then we need not pass a derivative argument. Otherwise we must pass one, and it must be initialized when necessary.

When the procedure is a function, then its derivative procedure may have an extra derivative parameter, which is the differential of the function's output. On the other hand, the function's output itself can very well be forgotten: remember the call to the derivative function are enclosed in a

**backward sweep**, so there is no one left there to use this function's result!

This is all we will say here about the general principles of AD in reverse mode. To go further, you may read the Tapenade reference article and the Tapenade 2.1 user's guide, that will describe conventions about variable names and declarations, differentiation of individual instructions, control structures, procedure calls, and in general all you must know to understand and use differentiated programs produced by Tapenade.

## Focus on multi-directional mode :

...