Backpropagation

A collection of functions \(\{ f_i \}_i\) forming a nested composition defines a composed function \(f\) as well as a functional \(F[\{ f_i \}_i]\). The composition is a directed acyclic graph.


    f0
   / | \
  f1 ..fk
/  \  / \
fk+1.....

Let \(x\) denote an input to the composed function \(f\), and Greek symbols \(\alpha, \beta, \gamma, ...\) possible parameters associated with one or several of the individual functions \(\{ f_i \}_i\); a parameter may be shared across functions. The input \(x\) as well as possible parameters can all formally be treated as leaves of the DAG.

An optimization problem for \(f\) wrt. any selected leaves (input \(x\) and/or parameters, typically just parameters) can be specified

If minimization is attempted by gradient descent, the chain rule from calculus gives derivatives in terms of those for node functions. The chain rule recursive definition for derivative of the ith node function wrt. a given parameter \(\alpha\) is

Formally there is a computational graph (DAG) of node functions, with inputs \(x\) (‘variables’) and parameters as leaf nodes, and directed edges from parent to children computations. A gradient descent step wrt. \(\alpha\)

1) determination of corresponding node inputs for a given leaf input \(x\) by evaluating node functions in the direction of the graph (from leaves to root).
2) Computing the parameter derivative \(\frac{d{f}}{d{\alpha}}\) by working backwards from the root down, that is unpacking the recursive chain rule definition, and evaluating all such derivatives at corresponding points determined by 1).

Step 1) is called ‘forward-propagation’, step 2) ‘backpropagation’, and taken together define the ‘backpropagation algorithm’ enabling minimization of a function minimization of a composed function by way of gradient descent.

A serial implementation can compute node operations in any order that preserves the computational graph dependencies; this defines a topological sort of the graph.

f ----- g

has a computational graph for variables \(x,y,z\), parameters \(\alpha, \beta\) and operations \(f,g\) satisfying \(y = g(x; \alpha)\), \(z = f(y; \beta)\)

beta        alpha 
|f           | g  
|            | 
z -------- y -------- x 
    f           g

\(\frac{df}{d \beta} = \frac{\partial f}{\partial \beta}\)
\(\frac{df}{d \alpha} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial \alpha}\)

   alpha   
  /     \
 /       \
/         \
z -------- y -------- x 
    f           g

and so \(\frac{df}{d \alpha} = \frac{\partial f}{\partial \alpha} + \frac{\partial f}{\partial g} \frac{\partial g}{\partial \alpha}\)

given computational graph and leaf input \(x\), compute relevant quantities (function values, derivatives) at \(x\) directly.

return another computational graph(s) that is the specified operation(s) at symbol \(x\) (function values, derivatives according to chain rule as sums/products of other derivatives given as functions). Finally handle all these graphs as generic computational graphs on input \(x\).

Benefit of method ii) is implicit recursion support- can generate higher order derivative graphs.

In the computational graph model introduce variables dz/dy, dy/dx, dz/dx (representing pointwise derivative evaluation) connected to their inputs by their respective derivative functions

dz/dy ------- y
         f'
dy/dx ------- x
         g'
dz/dx ------- x
    (f comp g)'

From the chain rule \((f \circ g)' = f' g'\) then variables dz/dy, dy/dx, dz/dx are related in a compositional graph by multiplication, so an entire compositional graph for variables and first order derivatives is (*need to include parameter derivatives in this graph -tbd)

 
 z -------- y -------- x 
     f      |     g    |    
            |f'        | g'
            |          |
          dz/dy      dy/dx
             \        /
              \ mult /
               \    /
                dz/dx

There are 5 operations and 6 variables in this first order graph. Generalizes to higher order derivatives.

Computational graph in summary: variables and operations. Given specific numerical values at nodes, can evaluate corresponding parts of the computational graph.

A typical function to learn is a loss function that is additive (and hence independent) over a per-sample function, \(l = \frac{1}{N} \sum_i l_0(\tilde{x_i})\). Since with analytical node gradient expressions it can be guaranteed that \(\nabla_{\theta} \{ \frac{1}{N} \sum_i l_0(\tilde{x_i}) \} = \frac{1}{N} \sum_i \nabla_{\theta} l_0(\tilde{x_i})\), only a single graph need be generated for \(l_0\) and the backpropagation step for each \(l_0(\tilde{x_i})\) can be performed in parallel (a sufficient sample of weight steps are then summed appropriately according to either asynchronous or synchronous SGD).

A textbook example is \(l = \frac{1}{N} \sum_i (w\tilde{x_i} + b - \tilde{y_i})^2\). A possible computational graph for \(l_0\) could be made with elementary nodes for matrix multiplication, addition, pair sum, and taking a square.

The particular granularity of elementary nodes and hence depth of the computational graph are a matter of choice. Depth of computational graph in one point of view is roughly analogous to the number of sequential instructions for a program, in another point of view depth of model > computation graph depth because knowledge of more complex concepts refine that of simpler concepts (deep probabilistic models view).