In part two of this tutorial we will explore some common techniques to improve ConvNet classification results. We will use the code base from the previous tutorial and add some lines here and there. In this part of the tutorial, we will focus on metric visualization, reproducibility, parameter reduction and learning rate adjustment.

**Random Seeds**

Before we start, we need to do two very important things. In order to keep our results reproducible, we have to fix the random seeds used in Theano, Lasagne and NumPy. Random numbers are generated throughout the entire training process, e.g. for dataset shuffling, image augmentation or layer weight initialization. Fixing the NumPy and Lasagne seeds is easily done with four lines in our code:

1 2 3 4 5 6 |
from lasagne import random as lasagne_random #Fixed random seed RANDOM_SEED = 1337 RANDOM = np.random.RandomState(RANDOM_SEED) lasagne_random.set_rng(RANDOM) |

We now have an object called **RANDOM** which we will use anywhere we need some randomness. By editing the *.theanorc* file, we can fix the seeds used in Theano. Simply add these lines:

1 2 3 |
[dnn.conv] algo_bwd_filer=deterministic algo_bwd_data=deterministic |

*( Note: It is completely up to you, which random seed you choose. Changing the random seed can sometimes negatively impact the classification accuracy due to shifts in the validation split*

*. This effect does not play such a big role on larger, more diverse dataset.)*

**Metrics**

Next, we should look at the results from our previous experiments. You might recall the mediocre validation accuracy of only ~44%. Let’s discuss some of the reasons why this may happen. Have a look at the PyPlot chart from the training of 30 epochs with our base implementation.

As we can see, after a few epochs, training loss and validation loss diverge by a great margin. This is a clear sign of overfitting. Our net becomes more and more capable of distinguishing the images from our training set. But it does so by focusing on semantically unlinked noise present in the dataset and not by recognizing the objects and their respective features. After training, our net is nearly perfect in predicting the labels of our training set, but it cannot transfer this knowledge to new images (our validation set). Sooner or later, every ConvNet will start overfitting. In our case, the ConvNet is not very complex and our dataset is pretty small (we might need tens of thousands of images or even millions to train a mighty ConvNet). It is important to keep track of the net accuracy by using validation images – images the net is not allowed to train on.

Another good way to visualize the performance of our ConvNet for classification problems is the so-called **“Confusion Matrix”**. It shows us the number of samples that were assigned correctly in a way, which allows us to see where our net needs to improve. We can plot the confusion matrix with PyPlot using the following code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from sklearn.metrics import confusion_matrix import itertools ################## CONFUSION MATRIX ##################### cmatrix = [] def clearConfusionMatrix(): global cmatrix #allocate empty matrix of size 5x5 (for our 5 classes) cmatrix = np.zeros((len(CLASSES), len(CLASSES)), dtype='int32') def updateConfusionMatrix(p, t): global cmatrix cmatrix += confusion_matrix(np.argmax(t, axis=1), np.argmax(p, axis=1)) def showConfusionMatrix(): #new figure plt.figure(1) plt.clf() #show matrix plt.imshow(cmatrix, interpolation='nearest', cmap=plt.cm.Blues) plt.title('Confusion Matrix') plt.colorbar() #tick marks tick_marks = np.arange(len(CLASSES)) plt.xticks(tick_marks, CLASSES) plt.yticks(tick_marks, CLASSES) #labels thresh = cmatrix.max() / 2. for i, j in itertools.product(range(cmatrix.shape[0]), range(cmatrix.shape[1])): plt.text(j, i, cmatrix[i, j], horizontalalignment="center", color="white" if cmatrix[i, j] > thresh else "black") #axes labels plt.ylabel('Target label') plt.xlabel('Predicted label') #show plt.show() plt.pause(0.5) |

*( Note: I changed some lines of code here and there in the script, which are not mentioned here. If you find yourself confronted with some errors, please have a look at the complete script from the GitHub repository.)*

We predict in batches of our validation set. That means, we have to accumulate all the single matrices of each epoch into one final confusion matrix. To do so, we need to call the method *clearConfusionMatrix()* at the beginning of each epoch, call *updateConfusionMatrix()* after every validation batch and run *showConfusionMatrix()* at the end of each epoch to show the classification results.

Let’s have a look at the confusion matrix from our baseline experiment:

Ideally, all of the predicted labels would we correct (target label = predicted label in the matrix) and diagonally aligned in our confusion matrix. This is clearly not the case. Our most accurate class seems to be the class “cow”, as the ConvNet correctly labeled 44 of 75 instances or 58.6%. The least accurate one is the class “dog” with only 19 of 78 (24,4%) correct labels assigned. 23 cats were labeled as dogs, 21 dogs were labeled as cats, which is somehow understandable, as cats and dogs might look alike in some of the 64×64 pixel images. Nevertheless, we should try to increase the ConvNet performance to avoid confusion.

So, what can we do? We will try some of the most common techniques to raise the validation accuracy. We will keep track of the current results and compare the influence of each technique with our baseline experiment.

So far our stats look like this:

Run |
Max. Accuracy |
Best Epoch |

Randomly guessing 1 of 5 classes | ~20% | – |

Baseline | 44.0% | 28 |

The accuracy of your experiments should always exceed the *default* or *random guessing accuracy*. Otherwise, your method is worse than guessing. Sometimes its easy to get fooled even by a validation dataset. Let’s assume we have two classes in our dataset. One of them contains dogs and consists of 100 images. The second class shows cats and includes 900 images. By default, we can reach an accuracy of 90% by simply assigning the class label *“cat”* to every image. We need to be careful with the interpretation of the evaluation metrics.

**Model Optimization**

Of all the things we could try to improve our results, we first should focus on the optimization of our baseline model architecture. When we see heavy overfitting in our loss chart, we can assume that our model has too many parameters. Wait…too many? Yes, indeed. Our classification task is not very complex with only five classes and we have to adjust the model design (we have to do this for every new classification task, every new dataset).

Despite the heterogeneous dataset, we do not need as many weights, as we currently have. Let’s cut down the parameter count of our model by splitting the number of filters in our convolutions and the numbers of hidden units in our dense layers in half. This reduces the net parameter count **by 75% to ~230.000**. The *buildModel()*-function now looks like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
################## BUILDING THE MODEL ################### def buildModel(): #this is our input layer with the inputs (None, dimensions, width, height) l_input = layers.InputLayer((None, 3, 64, 64)) #first convolutional layer, has l_input layer as incoming and is followed by a pooling layer l_conv1 = layers.Conv2DLayer(l_input, num_filters=16, filter_size=3, pad='same', nonlinearity=tanh) l_pool1 = layers.MaxPool2DLayer(l_conv1, pool_size=2) #second convolution (l_pool1 is incoming), let's increase the number of filters l_conv2 = layers.Conv2DLayer(l_pool1, num_filters=32, filter_size=3, pad='same', nonlinearity=tanh) l_pool2 = layers.MaxPool2DLayer(l_conv2, pool_size=2) #third convolution (l_pool2 is incoming), even more filters l_conv3 = layers.Conv2DLayer(l_pool2, num_filters=64, filter_size=3, pad='same', nonlinearity=tanh) l_pool3 = layers.MaxPool2DLayer(l_conv3, pool_size=2) #fourth and final convolution l_conv4 = layers.Conv2DLayer(l_pool3, num_filters=128, filter_size=3, pad='same', nonlinearity=tanh) l_pool4 = layers.MaxPool2DLayer(l_conv4, pool_size=2) #our cnn contains 3 dense layers, one of them is our output layer l_dense1 = layers.DenseLayer(l_pool4, num_units=64, nonlinearity=tanh) l_dense2 = layers.DenseLayer(l_dense1, num_units=64, nonlinearity=tanh) #the output layer has 5 units which is exactly the count of our class labels #it has a softmax activation function, its values represent class probabilities l_output = layers.DenseLayer(l_dense2, num_units=5, nonlinearity=softmax) #let's see how many params our net has print "MODEL HAS", layers.count_params(l_output), "PARAMS" #we return the layer stack as our network by returning the last layer return l_output NET = buildModel() |

If you run the training script again, you should end up with a validation accuracy of **45.2%** in epoch 25. Despite reducing the number of parameters of our net by a great margin, we score better because we delayed overfitting.

You could try to reduce the parameter count even further, you can save a lot of weights by reducing the filter count in the last convolutional layer or by reducing the amount of hidden units in the dense layers* (Remember: Dense layers connect every neuron with all neurons from the preceding layer, which results in a lot of weights)*. Reducing the capacity too much might result in underfitting, so be careful. Training with less weights might take longer, you might need to increase the number of epochs from 30 to 50 or even 100.

Our stats now look like this:

Run |
Max. Accuracy |
Best Epoch |

Randomly guessing 1 of 5 classes | ~20% | – |

Baseline | 44.0% | 28 |

Half filters, half hidden units | 45.2% | 25 |

*( Note: There is a distinct difference between overfitting and underfitting. Overfitting means our model has too much capacity to learn the dataset, which results in learned features that are not useful for the classification. Training and validation loss differ by a great margin. Underfitting on the other hand indicates that our model has not enough parameters to learn the dataset appropriately and both losses flatten out.)*

**Learning Rate Schedule**

The learning rate is arguably the single most important hyper-parameter of our model (hyper-parameters are settings we choose for things like batch size, learning rate, number of epochs to train and so on). As of now, we use a fixed learning rate for the entire training process. It would be better to decrease the learning rate after every epoch to converge the optimization process. Usually, there are different ways to do this. A common practice is to use learning rate steps, which lower the learning rate by a specific amount at certain key points during training (e.g. after epoch 10, 30 and 50). We could also interpolate the learning rate between the first and last epoch. This is what I did for this tutorial. I wanted the learning rate to be 0.0005 at first and 0.000001 after 20 epochs.

Implementing a dynamic learning rate in Lasagne is a bit complicated. Basically, we need to do five things:

1. Define a tensor variable, which stores our learning rate

1 |
lr_dynamic = T.scalar(name='learning_rate') |

2. Pass this tensor variable to the optimizer

1 |
param_updates = updates.adam(loss, params, learning_rate=lr_dynamic) |

3. Define this tensor variable as input for the training function

1 |
train_net = theano.function([layers.get_all_layers(NET)[0].input_var, targets, lr_dynamic], loss, updates=param_updates) |

4. Interpolate the learning rate before each epoch

1 |
learning_rate = LR_START - (epoch - 1) * ((LR_START - LR_END) / (EPOCHS - 1)) |

5. Pass the learning rate to the training function

1 |
l = train_net(image_batch, target_batch, learning_rate) |

*( Note: I defined LR_START, LR_END and EPOCHS in the “Config” section at the top of this script.)*

We fixed random seeds, added confusion matrix functionality, adapted the amount of parameters in our model and implemented a learning rate schedule. The training process should now look like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
EPOCH: 1 TRAIN LOSS: 1.63724 VAL LOSS: 1.51211 VAL ACCURACY: 33.5 % TIME: 4.0 s LEARNING RATE: 0.0005 EPOCH: 2 TRAIN LOSS: 1.44772 VAL LOSS: 1.47047 VAL ACCURACY: 38.1 % TIME: 4.5 s LEARNING RATE: 0.000473736842105 EPOCH: 3 TRAIN LOSS: 1.37498 VAL LOSS: 1.41601 VAL ACCURACY: 39.7 % TIME: 4.6 s LEARNING RATE: 0.000447473684211 EPOCH: 4 TRAIN LOSS: 1.31351 VAL LOSS: 1.39896 VAL ACCURACY: 41.6 % TIME: 4.6 s LEARNING RATE: 0.000421210526316 EPOCH: 5 TRAIN LOSS: 1.25273 VAL LOSS: 1.4042 VAL ACCURACY: 41.9 % TIME: 4.6 s LEARNING RATE: 0.000394947368421 EPOCH: 6 TRAIN LOSS: 1.19128 VAL LOSS: 1.37452 VAL ACCURACY: 44.4 % TIME: 4.6 s LEARNING RATE: 0.000368684210526 EPOCH: 7 TRAIN LOSS: 1.12561 VAL LOSS: 1.34629 VAL ACCURACY: 44.1 % TIME: 4.0 s LEARNING RATE: 0.000342421052632 EPOCH: 8 TRAIN LOSS: 1.05825 VAL LOSS: 1.33971 VAL ACCURACY: 47.6 % TIME: 4.0 s LEARNING RATE: 0.000316157894737 EPOCH: 9 TRAIN LOSS: 0.991669 VAL LOSS: 1.32198 VAL ACCURACY: 48.9 % TIME: 4.0 s LEARNING RATE: 0.000289894736842 EPOCH: 10 TRAIN LOSS: 0.912119 VAL LOSS: 1.3275 VAL ACCURACY: 49.5 % TIME: 4.0 s LEARNING RATE: 0.000263631578947 EPOCH: 11 TRAIN LOSS: 0.830577 VAL LOSS: 1.36689 VAL ACCURACY: 47.4 % TIME: 4.5 s LEARNING RATE: 0.000237368421053 EPOCH: 12 TRAIN LOSS: 0.761217 VAL LOSS: 1.35419 VAL ACCURACY: 46.1 % TIME: 4.7 s LEARNING RATE: 0.000211105263158 EPOCH: 13 TRAIN LOSS: 0.692618 VAL LOSS: 1.3444 VAL ACCURACY: 46.0 % TIME: 4.6 s LEARNING RATE: 0.000184842105263 EPOCH: 14 TRAIN LOSS: 0.624378 VAL LOSS: 1.36508 VAL ACCURACY: 45.5 % TIME: 4.6 s LEARNING RATE: 0.000158578947368 EPOCH: 15 TRAIN LOSS: 0.568079 VAL LOSS: 1.3632 VAL ACCURACY: 45.8 % TIME: 4.6 s LEARNING RATE: 0.000132315789474 EPOCH: 16 TRAIN LOSS: 0.521576 VAL LOSS: 1.37423 VAL ACCURACY: 46.3 % TIME: 4.5 s LEARNING RATE: 0.000106052631579 EPOCH: 17 TRAIN LOSS: 0.486885 VAL LOSS: 1.38195 VAL ACCURACY: 45.7 % TIME: 4.6 s LEARNING RATE: 7.97894736842e-05 EPOCH: 18 TRAIN LOSS: 0.463393 VAL LOSS: 1.37417 VAL ACCURACY: 45.7 % TIME: 4.6 s LEARNING RATE: 5.35263157895e-05 EPOCH: 19 TRAIN LOSS: 0.446437 VAL LOSS: 1.37335 VAL ACCURACY: 46.2 % TIME: 4.1 s LEARNING RATE: 2.72631578947e-05 EPOCH: 20 TRAIN LOSS: 0.437487 VAL LOSS: 1.37327 VAL ACCURACY: 45.9 % TIME: 4.0 s LEARNING RATE: 1e-06 TRAINING DONE! BEST VAL ACCURACY: 49.5 % EPOCH: 10 |

We now reached 49.5% validation accuracy, which is quite an improvement using the same model design. Additionally, we reached the best result after only 10 Epochs. This is very important for training on larger datasets: Adjusting the learning rate can speed up the entire training process by a great margin.

Our new result table looks like this:

Run |
Max. Accuracy |
Best Epoch |

Randomly guessing 1 of 5 classes | ~20% | – |

Baseline | 44.0% | 28 |

Half filters, half hidden units | 45.2% | 25 |

Learning rate schedule | 49.5% | 10 |

We now have everything in place to investigate more techniques for optimization. We could try a number of things for baseline optimization, feel free to test some different settings for batch size, model parameters, maybe even image size. I will introduce some ways of regularization, initialization and activation as well as different model architectures in future tutorials. Eventually, we will try to improve upon our experiments and come up with the best possible training scheme for our current task.