The next in the series of networks that popularized Convolution Neural Networks and Deep Learning is the seminal AlexNet. In 2012 AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). To quote from their Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Some of the key learnings from this implementation are

**ReLU Activation**

Up untill AlexNet, the commonly used activation was tanh. With ReLU the convergence was found to be much faster and this has since become the standard practice.**Local Response Normalization**

This is a technique used to highlight neurons with relatively large activation. Since ReLU has unbounded activation, normalizing it also helps in faster convergence. When this is applied to a uniformly activated neighbourhood , all activations are equally inhibited . While if applied in the neighborhood of an excited neuron the response gets even more sensitive and thus aids in improving the contrast.**Overlapping Pooling**

Stride smaller than the kernel size is called over-lapping pooling while stride equal to kernel size is call non-overlapping pooling. These pooling layers reduce the width and height of the tensor while maintaining the depth. Overlapping pooling was shown to reduce the error in this architecture.**Data Augmentation**

Even with a huge dataset like ImageNet, the large number of parameters in the model make it prone to overfitting. So there was a need for data augmentation to artificially increase the size of the dataset. The data augmentation happens on the CPU in batches while the GPU is training on the previous set and hence it does not have any computational overheads other than slowing down the convergence.

Several 227 x 227 patches were extracted from the input 256 x 256 image and their horizontal reflection are taken. The dataset is increased by $( 256 -227 )^2 = 841$ by patch extraction and $841 x 2$ by horizontal flip. This helps in reducing overfitting. The image intensities are also altered a concept that was esoteric to me. Perhaps a later write up on that.**Drop Out**

Another new concept introduced in this architecture. Each neuron has a 0.5 probablility of not contributing to the forward and the backward pass. This makes the network not rely on any one dominant neuron and effectively act as an average of several different networks thus reducing overfitting.

The Architecture of the network is as shown below. This image is from here

Layer Type | Output Size | Filter Size / Stride |
---|---|---|

INPUT IMAGE | 227x227x3 | |

CONV | 57x57x96 | 11x11=4x4;K = 96 |

ACT | 57x57x96 | |

BN | 57x57x96 | |

POOL | 16x16x96 | 3x3=2x2 |

DROPOUT | 28x28x96 | |

CONV | 28x28x256 | 5x5;K = 256 |

ACT | 28x28x256 | |

BN | 28x28x256 | |

POOL | 13x13x256 | 3x3=2x2 |

DROPOUT | 13x13x256 | |

CONV | 13x13x384 | 3x3;K = 384 |

ACT | 13x13x384 | |

BN | 13x13x384 | |

CONV | 13x13x384 | 3x3;K = 384 |

ACT | 13x13x384 | |

BN | 13x13x384 | |

CONV | 13x13x256 | 3x3;K = 256 |

ACT | 13x13x256 | |

BN | 13x13x256 | |

POOL | 13x13x256 | 3x3=2x2 |

DROPOUT | 6x6x256 | |

FC | 4096 | |

ACT | 4096 | |

BN | 4096 | |

DROPOUT | 4096 | |

FC | 4096 | |

ACT | 4096 | |

BN | 4096 | |

DROPOUT | 4096 | |

FC | 1000 | |

SOFTMAX | 1000 |

The network consists of 5 convolutional layers and 3 fully connected layers leading to about 60 million parameters to learn.

The **Keras** implementation of this would be as below

```
class AlexNet:
@staticmethod
def build(width, height, depth, classes, reg=0.0002):
# initialize the model along with the input shape to be
# "channels last" and the channels dimension itself
model = Sequential()
inputShape = (height, width, depth)
chanDim = -1
# if we are using "channels first", update the input shape
# and channels dimension
if K.image_data_format() == "channels_first":
inputShape = (depth, height, width)
chanDim = 1
# Block #1: first CONV => RELU => POOL layer set
model.add(Conv2D(96, (11, 11), strides=(4, 4),
input_shape=inputShape, padding="same",
kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(Dropout(0.25))
# Block #2: second CONV => RELU => POOL layer set
model.add(Conv2D(256, (5, 5), padding="same",
kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(Dropout(0.25))
# Block #3: CONV => RELU => CONV => RELU => CONV => RELU
model.add(Conv2D(384, (3, 3), padding="same",
kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(384, (3, 3), padding="same",
kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(256, (3, 3), padding="same",
kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
model.add(Dropout(0.25))
# Block #4: first set of FC => RELU layers
model.add(Flatten())
model.add(Dense(4096, kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))
# Block #5: second set of FC => RELU layers
model.add(Dense(4096, kernel_regularizer=l2(reg)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))
# softmax classifier
model.add(Dense(classes, kernel_regularizer=l2(reg)))
model.add(Activation("softmax"))
# return the constructed network architecture
return model
```

The tensorboard graph of AlexNet is shown below