Tutorial: Image Classification with Lasagne – Part 1

Introduction

Image classification is one of the major applications of artificial neural networks. It is also a good starting point for an introduction into the field of computer vision, often referred to as “Deep Learning”. Today’s tools and frameworks allow us to quickly implement all kinds of data processing neural networks; almost every state-of-the-art network implementation is published by the authors as an open source repository. Hence, we will dive right into the world of neural networks for visual recognition. The only thing you need is a basic understanding of programming languages (we will use Python in this tutorial), some knowledge on Artificial Neural Networks, access to a PC with a Linux OS (preferably Ubuntu 14.04 LTS) and a CUDA-capable Graphics Card (e.g. NVIDIA GTX 980). You will be able to run all examples using a CPU instead of GPU, but be aware that time for computation will increase significantly. If you have difficulties understanding some of the underlying concepts, don’t give up: I will provide you with useful links to complementary guides and articles during this tutorial. You will be able to run the code without further knowledge.

Code

You can find the code, additional files and the dataset from this tutorial in my GitHub repository: https://github.com/kahst

Prerequisites

There are a few things we need to install before we can start with our tutorial. The basic packages we need are the CUDA-toolkit (Installation instructions) and Python 2.7.x with the packages numpy, scikit-learn and matplotlib. After the installation of Python, you should be able to simply install everything else by typing this command:

(Note: You may have to add sudo to the command if you need admin privileges; installation my take a while, depending on your machine. If pip command is not available, try installing pip with apt-get install python-pip. If you have troubles installing scikit-learn, try installing SciPy systemwide first: https://www.scipy.org/install.html)

Visual computer recognition requires the efficient handling of image files. I prefer OpenCV for all image manipulation calls, feel free to use any other library (e.g. PIL). You can install the required CV2 packages for Ubuntu with this command:

In this tutorial we will use the easy-to-use and yet powerful Lasagne library to build and train neural networks in Theano. The installation is pretty straight forward:

If you need additional instructions, please see the Lasagne Installation instructions. Lasagne has an excellent documentation, you can find it here: http://lasagne.readthedocs.org/

We have to configure Theano if we want to make use of our GPU. We can do that by creating a ~/.theano.rc file in our home directory. The file should have the following content:

(Note: If you like to run the examples on your CPU, change the line device = cpu. You can find more information on the Theano config here: Theano Tutorial. Lasagne might not use the latest Theano release. If you encounter problems with CUDA and Theano, try updating your Theano version: Install Guide)

You can specify the CUDA version you like to use. You might also have difficulties with the CUDA compiler NVCC. Try adding the following to your .theanorc file:

(Note: If Lasagne still can’t find your GPU, try updating the NVIDIA graphics drivers. You can check the status of your GPU by typing nvidia-smi -l in the terminal.)

Dataset

We will start this tutorial by exploring the image dataset. You can find a ZIP-Archive containing the images we will use during this tutorial in my GoogleDrive. Download the file and extract it into a folder of your choice. I will simply call mine “dataset”. In it, you will find five subfolders each containing 500 images of various animals. This is a very small dataset with only five classes: Cat, Dog, Bird, Horse and Cow. All images are part of ImageNet synsets, you can find more here: ImageNet Database

Let’s have a look at the images from our dataset:

dataset

(Note: We may use images from ImageNet for non-commercial and educational purposes only, so please do not re-distribute these images.)

In order to keep things simple, we will use the subfolders from our dataset as class labels. Whenever you wish to add new categories to the dataset, simply create a new subfolder and put all image files in there.

Code editing is easy for Python: Any text editor will do. I personally prefer the Python IDLE which comes with our Python installation and supports simple text editing and syntax highlighting. Feel free to switch to whichever IDE you like most. We can start IDLE by typing “idle” in the command line (you might need to install idle first by typing: apt-get install idle).

After creating a new file, we will start with dataset parsing. We want to retrieve class labels from subfolders and all image paths so we can reference them later.

Usually, datasets are split in train, validation and test subsets. Training images are those the neural net “sees”. Validation and test images are used to monitor the performance of the net during the training process and are never shown to the net. In our case, a validation split of 15% will suffice.

Shuffling the dataset is very important, it prevents clustering of images of one class during a forward pass through the net. Additionally, we can ensure that our validation set contains images from every class in an equal distribution.

(Note: Python is an elegant programming language with multiple code-solutions to one problem. None of them is the best and none of them holds ultimate truth. There will always be ways to optimize your code, so feel free to change the provided code as you like.)

Building the Model

Convolutional Neural Networks (CNNs, ConvNets) are the first choice when it comes to image processing and visual recognition. ConvNets expect inputs to be images and allow us to drastically reduce the amount of parameters in our network. Unlike conventional neural networks (which consist of interconnected groups of neurons), CNNs connect neurons in one layer only to a small region (usually an area of 3×3) of the preceding layer. We can see the results of this process here:

No padding

no_padding_no_strides

Arbitrary padding

arbitrary_padding_no_strides

Half (or same) padding

same_padding_no_strides

Full padding

full_padding_no_strides

In these beautifully animated drawings (Source: https://github.com/vdumoulin) we can see how convolutional layers activate the neurons of a layer (green) by connecting them to a subset of the neurons of the preceding layer (blue). The filter size (gray) in all four images is 3×3 or 4×4 which by default results in varying amounts of neurons in the succeeding convolutional layer. If we want to avoid that, we can use “Zero-Padding” to artificially increase the dimension of the preceding layer. Padding has an impact on the shape of your convolutional layers and should always be taken into account.

If you want to learn more on the concept of ConvNets, you should definitely read this excellent guide.

Our first ConvNet

Every ConvNet is a sequence of layers. Most architectures feature three types of layers: Convolutional Layers, Pooling Layers and Dense Layers. Convolutional Layers compute the output of neurons connected to regions in the input, Pooling Layers perform downsampling on the input dimensions (width and height) and Dense Layers compute our class scores.

The architecture in our example is:

[INPUT – CONV – POOL – CONV – POOL – CONV – POOL – CONV – POOL – DENSE – DENSE – DENSE]

Building a ConvNet in Lasagne is easy. Let’s have a look at the code:

First, we have to import Lasagne (or at least parts of it; in our case Layers and the Nonlinearities “softmax” and “tanh“). Run the code (simply press F5 in IDLE), this may take a few seconds, if everything goes well, you should see a message like this:

This means, Lasagne and Theano are able to use our CUDA-capable graphics card (the device specified in our ~/.theano.rc) and the CuDNN extension of the CUDA-toolkit.

Convolutional Layers expect the input to be a 4D-vector. Therefore, our Input Layer has the shape (None, 3, 64, 64). This means the size of our input image (RGB = 3 Dimensions) is 64×64 pixels. This is significantly smaller than the source images from our dataset. Scaling these images up consumes considerably more memory, so we want to keep them just large enough to make their contents recognizable. Either way, the input has to be of uniform width and height in order to fit into our convolutions, so only square images will do. The first argument “None” is a placeholder and allows us to change the number of images we pass through the net at once later on.

We use a filter size of 3×3 and “same” padding in our example, which means no spatial transformations for our inputs. We can specify the number of filters we want to apply, usually the number of filters increases with every following convolution. So, after the first convolution the number of dimensions of our input has increased  from 3 (RGB) to 32 (number of filters).

Pooling Layers with a pool size of “2” reduce the input spatial dimension by half. This would mean that the resulting spatial size after our first Pooling Layer is 32×32 pixels. But remember, only if you use “same” padding the output shapes will look like this:

Every neuron in our convolutional layers has an activation function, which determines the resulting output. In Lasagne, we can choose from various activation functions; we will use the tangens hyperbolicus (tanh) for this tutorial.

Class probabilities are the result of the neuron activations of the last layer. The highest of these five values is the class most likely depicted in our input image. Using the so called “Softmax” activation function (as nonlinearity) results in neuron outputs between 0 and 1. This function also normalizes the outputs so that their sum is always 1. Therefore, an output vector with the values [0.1, 0.25, 0.5, 0.1, 0.05] indicates that class “3” is the most probable for a given input image.

Lasagne provides us with a number of helper functions, one of them counts all parameters of our ConvNet. Even an architecture as simple as ours has more than 900.000 parameters (weights) we have to optimize during training.

Loss and Updates

In order to train our ConvNet, we need to specify what our objective is. Obviously, it is getting the class label right on a given input image. But how can we quantify that? We need a measure for how good the net actually predicts. This is where the loss function is needed. It calculates the error of the prediction. Minimizing this loss is the objective of our training process.

We use the categorical crossentropy as loss measure. To do that, we need two things: The current prediction of our ConvNet on a batch of images and the corresponding targets (class labels as vector; we will prepare those later during the batch generation) stored in a Theano variable.

Same for the accuracy function. We can use the Lasagne objective “categorical_accuracy” to calculate the top-1 accuracy.

Again, all we need are predictions and targets, we will assign those later during the training process using our validation split (this is important, we must not use the train split to evaluate the performance of the net; we might get fooled otherwise).

(Note: The Top-k accuracy means that the correct class label is among the k highest class probabilities of the net output. Usually the accuracy of a ConvNet is measured with Top-1 and Top-3 accuracy. This only makes sense if we have significantly more class labels to predict.)

The update function of our net will update the network parameters after every batch of images to find the optimal configuration. We can choose between different methods of optimization, they are all based on gradient descent and aim at minimizing the loss. However, there are notable differences in convergence speed and accuracy. This mostly depends on the given optimization problem. In our case, the ADAM updates are a good choice to optimize the parameters of our ConvNet.

We need to specify which parameters of our ConvNet should be altered (in our case all network weights, there might be use-cases where only a fraction of those parameters should be optimized). Therefore, we use the Lasagne helper function get_all_params and pass them to our optimizer in combination with the calculated loss value.

The learning rate specifies the amount of parameter changes during backpropagation. Choosing the right learning rate is a bit tricky and depends on your network and problem you like to solve. A small learning rate slows down the learning process, but usually improves the result of parameter optimization. On the other hand, a higher learning rate might speed things up a little, but the optimizer might never find the perfect parameter configuration (learning rates that are too high often cause “exploding” losses). Usually, learning rates range from 0.01 to 0.00001.

You can find additional information on neural network optimization here: http://cs231n.github.io/optimization-1/

Theano Functions

We defined our objective and optimizer, now we need a way to pass some images through the net (forward pass), calculate the loss and update parameters (backpropagation, backward pass). We can do this by the definition of a Theano function which takes our images and corresponding targets as inputs and executes the entire process of ConvNet training. We will call this function many times, until we are satisfied with the net output. The compilation of Theano functions takes a while if run for the first time, so now is a good time to get some coffee…

We decided to use 15% of our dataset as validation split. So, those images should not be part of the ConvNet training. If we want to measure how good our ConvNet performs after a certain amount of training loops, we use another Theano function for testing. This prediction function has no effect on the trainable parameters of our net, it simply passes some images through the net and generates an output vector. We use this output to calculate the validation loss and accuracy.

Training

When we train our ConvNet, we don’t pass one single image at a time through the net, neither do we pass our complete dataset. We do this in chunks or batches of images. The size of these batches is a crucial value, too. We shuffled our image paths, so the distribution of classes should be equal in every batch, but larger batches present the net a more heterogeneous glimpse at our dataset. Remember, ConvNet parameters are updated after every batch. If the batch size we choose is too small, we might update our net based on patterns present only in this particular batch. If our batch size is too large, identifying patterns might get too difficult and training may slow down. A batch size of 128 images per batch is reasonable and we should start with that.

(Note: Changing the batch size has also implications on our learning rate. We should lower the learning rate with decreasing batch size and vice versa.)

Dataset Batches

Every batch consist of image pixel values and class targets. So far, we collected the paths of all images from our dataset. Now we need to load actual image data:

As I mentioned earlier, I like OpenCV for image operations. Therefore, we need to import CV2 and NumPy in order to load images from our hard drive. Opening an image and preparing it for a batch is easy this way. We read an image from the given file path, scale it to 64×64 pixels (this will squeeze the image), transpose its axes and reshape its dimensions, so it fits our ConvNet input shape.

Our classification target has to be in the right shape as well. The output of our ConvNet is an array with 5 values. So, every class label has to be represented by such an array. We want to use subfolders as class labels, so the only thing we have to do is getting the index of the image label from our classes and convert it into an array with 5 values. If the label for an image is “dog” and the index of “dog” in our classes is “2”, this results in an array like this: [0.0, 0.0, 1.0, 0.0, 0.0]. If our net classifies an image as dog, its output vector should look just like this target array.

(Note: Labels in our CLASSES array won’t necessarily be in alphabetical order, so be careful when assigning indices to labels.)

In order to fetch 128 images and their corresponding target arrays, we need to iterate over our train split of all the image paths and return every time we reach 128 samples.

Again, we allocate two NumPy arrays and fill them with data before returning them. There is nothing really special about the two functions above, except for the “yield” statement. It creates a generator object which returns different values every time we call it. This is really useful and saves us the trouble of saving the current state of iteration. All we have to do is call the yielding function inside a loop. We can specify whether to use our train or validation split (they’re both simple lists with image paths in them) which we will do later.

Iterations and Epochs

Our training split contains 2125 images. At a batch size of 128, we have to pass a total of 17 batches through the net in order to present every image one time. Each time we perform a forward and backward pass (= 1 batch) we call this an “iteration”. If we complete passing the whole dataset through the net (= after 17 batches) we call this an “epoch”. This means, the ConvNet parameters from our example are updated 17 times during one epoch.

In our example, training lasts 30 epochs. The train and test functions we defined earlier are called every epoch. We get batches from our train and validation split of the dataset and pass them through the net. We also keep track of training loss, validation loss and validation accuracy. This is important for future optimizations.

Statistics

You should be able to run the code, parse the dataset, build a model and train it accordingly. Your output should look similar to this:

We can visualize the losses and validation accuracy with pyplot:

Your chart should look like this:

Depending on your hardware, training one epoch should take a few seconds. If you are using CPU-mode of Theano you should experience a significantly slower training process. Validation accuracy will start at ~30% (which is hardly better than randomly guessing one of five classes) and reaches its maximum at ~44%. That’s far from great, but it’s a start. We will discuss a variety of improvements in part 2 of this tutorial.

Thanks for reading!

Next: Image Classification with Lasagne – Part 2 (Coming soon…)

Leave a Comment