The next in the series of networks that popularized Convolution Neural Networks and Deep Learning is the seminal AlexNet. In 2012 AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). To quote from their Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Some of the key learnings from this implementation are

1. ReLU Activation
Up untill AlexNet, the commonly used activation was tanh. With ReLU the convergence was found to be much faster and this has since become the standard practice.
2. Local Response Normalization
This is a technique used to highlight neurons with relatively large activation. Since ReLU has unbounded activation, normalizing it also helps in faster convergence. When this is applied to a uniformly activated neighbourhood , all activations are equally inhibited . While if applied in the neighborhood of an excited neuron the response gets even more sensitive and thus aids in improving the contrast.
3. Overlapping Pooling
Stride smaller than the kernel size is called over-lapping pooling while stride equal to kernel size is call non-overlapping pooling. These pooling layers reduce the width and height of the tensor while maintaining the depth. Overlapping pooling was shown to reduce the error in this architecture.
4. Data Augmentation
Even with a huge dataset like ImageNet, the large number of parameters in the model make it prone to overfitting. So there was a need for data augmentation to artificially increase the size of the dataset. The data augmentation happens on the CPU in batches while the GPU is training on the previous set and hence it does not have any computational overheads other than slowing down the convergence.
Several 227 x 227 patches were extracted from the input 256 x 256 image and their horizontal reflection are taken. The dataset is increased by $( 256 -227 )^2 = 841$ by patch extraction and $841 x 2$ by horizontal flip. This helps in reducing overfitting. The image intensities are also altered a concept that was esoteric to me. Perhaps a later write up on that.
5. Drop Out
Another new concept introduced in this architecture. Each neuron has a 0.5 probablility of not contributing to the forward and the backward pass. This makes the network not rely on any one dominant neuron and effectively act as an average of several different networks thus reducing overfitting.

The Architecture of the network is as shown below. This image is from here

Layer Type Output Size Filter Size / Stride
INPUT IMAGE 227x227x3
CONV 57x57x96 11x11=4x4;K = 96
ACT 57x57x96
BN 57x57x96
POOL 16x16x96 3x3=2x2
DROPOUT 28x28x96
CONV 28x28x256 5x5;K = 256
ACT 28x28x256
BN 28x28x256
POOL 13x13x256 3x3=2x2
DROPOUT 13x13x256
CONV 13x13x384 3x3;K = 384
ACT 13x13x384
BN 13x13x384
CONV 13x13x384 3x3;K = 384
ACT 13x13x384
BN 13x13x384
CONV 13x13x256 3x3;K = 256
ACT 13x13x256
BN 13x13x256
POOL 13x13x256 3x3=2x2
DROPOUT 6x6x256
FC 4096
ACT 4096
BN 4096
DROPOUT 4096
FC 4096
ACT 4096
BN 4096
DROPOUT 4096
FC 1000
SOFTMAX 1000

The network consists of 5 convolutional layers and 3 fully connected layers leading to about 60 million parameters to learn.
The Keras implementation of this would be as below

class AlexNet:
@staticmethod
def build(width, height, depth, classes, reg=0.0002):
# initialize the model along with the input shape to be
# "channels last" and the channels dimension itself
model = Sequential()
inputShape = (height, width, depth)
chanDim = -1

# if we are using "channels first", update the input shape
# and channels dimension
if K.image_data_format() == "channels_first":
inputShape = (depth, height, width)
chanDim = 1

# Block #1: first CONV => RELU => POOL layer set
kernel_regularizer=l2(reg)))

# Block #2: second CONV => RELU => POOL layer set
kernel_regularizer=l2(reg)))

# Block #3: CONV => RELU => CONV => RELU => CONV => RELU
kernel_regularizer=l2(reg)))
kernel_regularizer=l2(reg)))
kernel_regularizer=l2(reg)))

# Block #4: first set of FC => RELU layers

# Block #5: second set of FC => RELU layers