Arithmetics. So far, we have covered convolutional and max-pooling layers as well as stride and padding. You can thus convince yourself that the output dimension of any layer can be calculated using the formula:
(W - F + 2P) / S + 1
where
Example. Now that we have learned all the important components of CNNs, we can tie it all together using AlexNet as an example.
AlexNet is a CNN-based model that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, beating the next best model by more than 10% in top-5 error (i.e. the model's top five predictions are all incorrect). This model with 5 convolutional layers and 3 fully-connected layers was not only state-of-the-art less than six years ago, but also single-handedly brought in today's wave of deep learning into computer vision research. Fun stuff!
AlexNet takes in an image of size 227 x 227 x 3 and applies 96 filters of size 11 x 11 with stride 4 to get a new volume of 55 x 55 x 96. Max pooling of size 2 and stride 2 is then applied to generate 27 x 27 x 96 (1 pixel is ignored since the result has to work out to an integer). After a few more layers, we get a volume of size 13 x 13 x 256. This output is connected to 2 fully-connected layers with 4096 neurons and a final FC layer with 1000 neurons, each representing the probability of the 1000 ImageNet classes via a softmax function.
As a fun tidbit, the original paper claims the input images were 224 x 224, but the math simply does not work out. This has been one of the great mysteries in CNN history, but the best guess is that the images were zero-padded with 3 pixels but the authors forgot to mention it.