Convolutions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Convolution

\[s_i = w_l x_{i-l}\]

a bilinear operation in x, w, defined over all integer indices. By convention \(x\) and \(w\) have non-zero values beginning from the 0th index. The largest index (minus one) of \(w\) with a non-zero value defines the kernel size, giving the longest correlation between \(s\) and \(x\).

Is a constrained linear operation (dignified as ‘Toeplitz matrix’) and generalizes to multiple dimensions, \(s_{ijk ...} = A_{lmn ...} x_{i-l, j-m, k-n, ...}\). Some dimensions may be general linear transformations (‘channels’ in image processing).

Determined by translation invariance under Lie group \(\mathbb{R}^n\). Given \(T\) a Lie algebra element of \(\mathbb{R}^n\), \(K\) the convolution operator

\[T[x]_i := x_{i+k}\] \[K[x]_i := w_l x_{i-l}\]

Invariance under group action requires

\[[T, K] = 0\] \[K_{i, m-k} = K_{i+k, m}\] \[K_{im} = \xi(i-m)\]

for some function \(\xi\). So \(K\) is the convolution operation. Can apply same logic with other symmetry groups.

Translation invariance example: object in image detected regardless of location in scene.

Neural network lingo

Dilation, stride, tiling common structures in convolutional kernel:

\(s_i = x_{\lambda i - l} w_{l}\) (stride \(\lambda\))

\(s_i = x_{\alpha (i - l)} w_{l}\) (dilation \(\alpha\))

\(s_i = x_{i-l} w_{l\bmod\beta}\) (tiling \(\beta\))

A dilation skips every \(\alpha\) input, and a stride skips regions of size \(\lambda\) of the input. Tiling alters the weights periodicity of the convolution, defaulting to \(\beta = 1\). From its definition it must be less than the convolution length to have any effect.

Composition of convolutions is a convolution (\(K_1 K_2\) commutes with \(T\) for \(K_1, K_2\) commuting separately). Composed convolution effective receptive field size no greater than sum of individual effective receptive field sizes.

Dilation \(\alpha\) creates receptive field length \(\alpha (k - 1) + 1\) for kernel size k. With a dilation increase factor \(n\) can stack \(l\) increasingly dilated convolutions beginning with \(\alpha\) for effective receptive field of size

\[(\alpha (k - 1) + 1)(1 + n + n^2 + .... + n^l) = (\alpha (k - 1) + 1) \frac{(n^{l+1}-1)}{(n-1)} \sim \mathcal{O}(\alpha k n^l) \sim \mathcal{O}( n^l)\]

with respect to dilation factor n and # of layers l. For example Wavenet sets \(n = 2\).