$$\newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1}$$

# Convnet: Implementing Maxpool Layer with Numpy

Traditionally, convnet consists of several layers: convolution, pooling, fully connected, and softmax. Although it’s not true anymore with the recent development. A lot of things going on out there and the architecture of convent has been steadily (r)evolving, something like Google’s Inception module found in GoogLeNet and the recent ImageNet champion: ResNet.

Nevertheless, conv and pool layers are still the essential foundations of convnet. We’ve covered the conv layer in the last post. Now let’s dig into pool layer, especially maxpool layer.

## Pool layer

Whereas conv layer applies filter to the input images, pool layer reduce the dimensionality of the output. Reducing the dimensionality of the images is a necessity in convnet as we’re dealing with a high dimensional data and a lot of filters, which implies that we will have huge amount of parameters in our convnet.

What pool layer does is simple: at each patch of the image, do a summarization operation on it. If we recall, that’s mildly similar to what conv layer does. Whereas at each location we’re taking the dot product in conv layer, we’re going to do simple summarization in a particular image patch.

The summarization operation could be any summary statistics: average, max, min, median, you name it. But, the most widely used operation is the max operation. This, combined with the adjustment of the size, padding, and stride of our image patch will result in some nice properties that are useful for our model.

Those are, mainly, the reduction of the dimensionality = less parameter = less computation burden; and slightly more robust model, because we’re taking “high level views” of our images, the network will be slightly invariant towards small changes like rotation, translation, etc.

## Maxpool layer

Knowing what pool layer does, it’s trivial to think about the summarization operation. For maxpool layer, it’s just taking the maximum value of each image patch.

It’s just the same as conv layer with one exception: max instead of dot product.

## Maxpool forward

As we already know that maxpool layer is similar to conv layer, implementing it is somewhat easier.

That’s it for the forward computation of maxpool layer. However, instead of getting the maximum value directly, we did an intermediate step: getting the maximum index first. This is because the index are useful for the backward computation.

At above example, we could see how maxpool layer will reduce the computation for the subsequent step. Doing 2x2 pooling with stride of 2 and no padding will essentially reduce the image dimension by half.

For example, we have this single MNIST data of 28x28:

After we fed the image to our maxpool layer, the result will look like this:

## Maxpool backward

Recall, how do we compute the gradient for ReLU layer. We let the gradient pass through when the ReLU result is non zero, and otherwise we block the gradient by setting it to zero.

Maxpool layer is similar, because that’s essentially what max operation do in backpropagation.

Recall the ReLU gradient is dX[X <= 0] = 0, as we’re doing max(0, x) in ReLU. We’re basically applying that to our stretched image patches. Only, we start with 0 matrix and put the gradient in the correct location, and we’re taking the max of the image patch, instead of comparing it with 0 like we do in ReLU.

## Conclusion

We see that pool layer, specifically maxpool layer is similar to conv and ReLU layer. It’s similar as conv as we need to stretch our input image with im2col to get the all possible image patches where we are going to take the maximum value over. It’s similar as ReLU as we’re doing the same operation: max.

We also see that doing maxpool with certain parameters, e.g. 2x2 maxpool with stride of 2 and padding of 0 will essentially halve the dimension of the input.