Deep Learning Convolutions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Convolutions are general signal processing operations. In deep learning specifically, they are a workhorse of image processing dictated by translation invariance: an object is detected regardless of its location in the scene.

Supposing \(\mathbb{R}^n\) taken as a Lie group, the corresponding Lie algebra element \(T\) determines the convolution operator \(K\):

\[T[x]_i := x_{i+k}\] \[K[x]_i := w_l x_{i-l}\]

Invariance under group action requires

\[[T, K] = 0\] \[K_{i, m-k} = K_{i+k, m}\] \[K_{im} = \xi(i-m)\]

for some function \(\xi\). So \(K\) is the convolution operation. The same logic applies to any given symmetry group- a party trick in theoretical physics that defines the ‘Standard Model’ as SU(3) x SU(2) x U(1).

Some neural network lingo:

A mapping \(\phi\) over an affine transformation involving convolution \(K\) is a convolutional layer, \(\phi(Kx + b)\)
Parameter-sharing is the dignified terminology for convolutional layers being constrained affine maps.
Commutation \([T, K] = 0\) between the symmetry group Lie algebra element \(T\) and the corresponding neural network operation \(K\) is called ‘equivariance’.
\(KT = K\) is called ‘invariance’, which is differs from typical Lie group terminology.
Pooling operations (take an average over a region) are called ‘slight invariance’.

Dilation, stride, and tiling are common structures in convolutional kernels. Adding structure to the general convolution expression

\[s_i = w_l x_{i-l}\]

A stride \(\lambda\) is

\[s_i = x_{\lambda i - l} w_{l}\]

A dilation \(\alpha\) is

\[s_i = x_{\alpha (i - l)} w_{l}\]

A tiling \(\beta\) is

\[s_i = x_{i-l} w_{l\bmod\beta}\]

A dilation skips every \(\alpha\) input, and a stride skips regions of size \(\lambda\) of the input. Tiling alters the weights periodicity of the convolution, defaulting to \(\beta = 1\), and from its definition it must be less than the convolution length to have any effect.

A practical application is stacking convolutional layers to have effects over long distances, for example for building generative autoregressive models for audio as in Wavenet.

The ‘effective receptive field size’ of stacked convolutions needs to be large. The composition of convolutions is a convolution, since \(K_1 K_2\) commutes with \(T\) for \(K_1, K_2\) commuting separately. The composed convolution effective receptive field size will be no greater than the sum of the individual effective receptive field sizes. A dilation in a layer increases the effective field size to \(\alpha (k - 1) + 1\) for kernel size \(k\). Stacking \(l\) layers with a dilation factor that increases by \(n\) builds up an effective receptive field size:

\[(\alpha (k - 1) + 1)(1 + n + n^2 + .... + n^l) \sim \mathcal{O}( n^l)\]

Wavenet sets \(n = 2\).