AlexNet


The next in the series of networks that popularized Convolution Neural Networks and Deep Learning is the seminal AlexNet. In 2012 AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). To quote from their Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Some of the key learnings from this implementation are

  1. ReLU Activation
    Up untill AlexNet, the commonly used activation was tanh. With ReLU the convergence was found to be much faster and this has since become the standard practice.
  2. Local Response Normalization
    This is a technique used to highlight neurons with relatively large activation. Since ReLU has unbounded activation, normalizing it also helps in faster convergence. When this is applied to a uniformly activated neighbourhood , all activations are equally inhibited . While if applied in the neighborhood of an excited neuron the response gets even more sensitive and thus aids in improving the contrast.
  3. Overlapping Pooling
    Stride smaller than the kernel size is called over-lapping pooling while stride equal to kernel size is call non-overlapping pooling. These pooling layers reduce the width and height of the tensor while maintaining the depth. Overlapping pooling was shown to reduce the error in this architecture.
  4. Data Augmentation
    Even with a huge dataset like ImageNet, the large number of parameters in the model make it prone to overfitting. So there was a need for data augmentation to artificially increase the size of the dataset. The data augmentation happens on the CPU in batches while the GPU is training on the previous set and hence it does not have any computational overheads other than slowing down the convergence.
    Several 227 x 227 patches were extracted from the input 256 x 256 image and their horizontal reflection are taken. The dataset is increased by $( 256 -227 )^2 = 841$ by patch extraction and $841 x 2$ by horizontal flip. This helps in reducing overfitting. The image intensities are also altered a concept that was esoteric to me. Perhaps a later write up on that.
  5. Drop Out
    Another new concept introduced in this architecture. Each neuron has a 0.5 probablility of not contributing to the forward and the backward pass. This makes the network not rely on any one dominant neuron and effectively act as an average of several different networks thus reducing overfitting.

The Architecture of the network is as shown below. This image is from here

Layer Type Output Size Filter Size / Stride
INPUT IMAGE 227x227x3  
CONV 57x57x96 11x11=4x4;K = 96
ACT 57x57x96  
BN 57x57x96  
POOL 16x16x96 3x3=2x2
DROPOUT 28x28x96  
CONV 28x28x256 5x5;K = 256
ACT 28x28x256  
BN 28x28x256  
POOL 13x13x256 3x3=2x2
DROPOUT 13x13x256  
CONV 13x13x384 3x3;K = 384
ACT 13x13x384  
BN 13x13x384  
CONV 13x13x384 3x3;K = 384
ACT 13x13x384  
BN 13x13x384  
CONV 13x13x256 3x3;K = 256
ACT 13x13x256  
BN 13x13x256  
POOL 13x13x256 3x3=2x2
DROPOUT 6x6x256  
FC 4096  
ACT 4096  
BN 4096  
DROPOUT 4096  
FC 4096  
ACT 4096  
BN 4096  
DROPOUT 4096  
FC 1000  
SOFTMAX 1000  

The network consists of 5 convolutional layers and 3 fully connected layers leading to about 60 million parameters to learn.
The Keras implementation of this would be as below

class AlexNet:
	@staticmethod
	def build(width, height, depth, classes, reg=0.0002):
		# initialize the model along with the input shape to be
		# "channels last" and the channels dimension itself
		model = Sequential()
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# Block #1: first CONV => RELU => POOL layer set
		model.add(Conv2D(96, (11, 11), strides=(4, 4),
			input_shape=inputShape, padding="same",
			kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
		model.add(Dropout(0.25))

		# Block #2: second CONV => RELU => POOL layer set
		model.add(Conv2D(256, (5, 5), padding="same",
			kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
		model.add(Dropout(0.25))

		# Block #3: CONV => RELU => CONV => RELU => CONV => RELU
		model.add(Conv2D(384, (3, 3), padding="same",
			kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(384, (3, 3), padding="same",
			kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(256, (3, 3), padding="same",
			kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
		model.add(Dropout(0.25))

		# Block #4: first set of FC => RELU layers
		model.add(Flatten())
		model.add(Dense(4096, kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization())
		model.add(Dropout(0.5))

		# Block #5: second set of FC => RELU layers
		model.add(Dense(4096, kernel_regularizer=l2(reg)))
		model.add(Activation("relu"))
		model.add(BatchNormalization())
		model.add(Dropout(0.5))

		# softmax classifier
		model.add(Dense(classes, kernel_regularizer=l2(reg)))
		model.add(Activation("softmax"))

		# return the constructed network architecture
		return model

The tensorboard graph of AlexNet is shown below

Related Posts

PID Controller

PID Controller for autonomous driving

Highway Path Planning

Trajectory planning for a car driving on a highway

Vehicle Localization

Vehicle localization using Particle Filter

Behavioral Cloning

Autonomous Driving through Behavioral Cloning

Traffic Signal Classification

German Traffic Signal Classification using Convolution Neural Network

Advanced Lane Line Detection

Advanced lane line detection using OpenCV

Basic Lane Line Detection

Real time lane line detection using OpenCV

Classification of Flowers using Tensorflow 2.0

Implementing InceptionV3 Deep Learning Model on TF_FLOWERS dataset on Google Colab platform

k-NN for MNIST Classification

MNIST hand written digits classification using KNN

Mid Week Motivation

A little bit of motivation on a hump day