Essential techniques for systematically improving model performance

8 min readMar 4, 2023

Machine learning has become an indispensable tool in numerous industries, from finance to healthcare to retail. However, as the complexity of the data and the models involved increases, so too does the challenge of achieving optimal performance. Fortunately, there are a variety of essential techniques that can be employed to systematically improve model performance. In this article, we will explore some of these techniques, from hyperparameter tuning to TPU training, and provide practical tips for implementing them effectively. Whether you are a seasoned machine learning practitioner or just starting out, this guide will help you take your models to the next level.

When it comes to improving model performance in my opinion we need two things, a strategy and speed to execution. We need to have a plan and we need to execute that plan fast. Executing those plans fast gives you the ability to learn faster. Learn from the past experiments, get a little bit better in the next one, and the circle goes on and on.

Therefore additionally to finding the best hyperparameters and architectures, we will explain how to speed up and scale up model training through multi-GPU and TPU training, mixed precision, and using remote computing resources in the cloud in order to be efficient when experimenting. The goal is to move beyond satisfactory results and achieve outstanding performance, including winning machine learning competitions, through the use of essential techniques for building state-of-the-art deep learning models.

PART I — OPTIMIZATION

Hyperparameter optimization

When creating a deep learning model, there are numerous architecture-level parameters, also known as hyperparameters, that need to be determined, such as the number of layers, units per layer, and activation functions. These decisions are difficult to make and require a lot of trial and error, even for experts. Therefore, it’s essential to explore the space of potential decisions in a systematic and principled manner to identify the best-performing architectures. Automatic hyperparameter optimization is a field of research dedicated to this process, which involves using machine learning techniques to search the architecture space and find the optimal hyperparameters. This is an important area of study in machine learning.

While updating the weights of a deep learning model is relatively simple by computing the loss function on a mini-batch of data and then using backpropagation to adjust the weights, updating hyperparameters presents unique challenges. This is because the hyperparameter space usually consists of discrete decisions that are not continuous or differentiable. As a result, gradient descent cannot be applied to the hyperparameter space, and gradient-free optimization techniques have to be used instead. Additionally, computing the feedback signal for the optimization process can be costly and noisy, leading to difficulty in determining whether a certain set of hyperparameters would lead to a high-performing model. Fortunately, there is a tool called KerasTuner that simplifies the process of hyperparameter tuning.

If you want to adopt a modular and configurable approach to model-building, you can subclass the HyperModel class and define a build method, as follows.

import kerastuner as kt
  
class SimpleMLP(kt.HyperModel):
    def __init__(self, num_classes):                                 
        self.num_classes = num_classes
  
    def build(self, hp):                                             
        units = hp.Int(name="units", min_value=16, max_value=64, step=16)
        model = keras.Sequential([
            layers.Dense(units, activation="relu"),
            layers.Dense(self.num_classes, activation="softmax")     
        ])
        optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
        model.compile(
            optimizer=optimizer,
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"])
        return model
  
hypermodel = SimpleMLP(num_classes=10)

When performing automatic hyperparameter optimization on a large scale, it is crucial to consider the problem of validation set overfitting. This is because the hyperparameters are updated based on a signal computed using the validation data, which means they are being trained on this data and may quickly overfit to it. It is important to be aware of this issue and take steps to prevent it.

The art of crafting the right search space

It is important to carefully design the search space when using hyperparameter tuning. Although it is a way to automate experiments that would otherwise be done manually, it is still necessary to choose configurations that have the potential to produce good results. KerasTuner offers pre-made search spaces that are relevant to different types of problems, such as image classification. By using these pre-made spaces and adding your data, you can run the search and get a decent model. Hypermodels like kt.applications.HyperXception and kt.applications.HyperResNet are adjustable versions of Keras Applications models that you can try.

Model ensembling

Model ensembling is a powerful technique for achieving optimal results on a given task. The idea behind ensembling is that several high-performing models, trained independently, are likely to be successful for different reasons. This is because each model focuses on slightly different aspects of the data to make predictions, and therefore captures part of the “truth,” but not all of it. The key to successful ensembling lies in the diversity of the set of classifiers used. If all the models are biased in the same way, the ensemble will also be biased in the same way. However, if the models are biased in different ways, the biases will offset each other, resulting in a more robust and accurate ensemble. Thus, it is important to ensemble models that are both highly effective and as dissimilar as possible.

PART II — SCALING UP MODEL TRAINING

The machine learning cycle involves coming up with an idea, implementing it in a deep learning framework, examining the results, and repeating the process. Once you become skilled in using the Keras API, the speed at which you can create your deep learning experiments will no longer be the bottleneck of the progress cycle. Instead, the next bottleneck will be the speed at which you can train your models. With fast training infrastructure, you can obtain results within 10 to 15 minutes, allowing you to go through numerous iterations each day. Quicker training enhances the quality of your deep learning solutions.

Training loop from the book — Deep Learning with Python, Second Edition Francois Chollet

This section discusses three ways to train your models faster.

Mixed-precision training, which you can use even with a single GPU
Training on multiple GPUs
Training on TPUs

MIXED-PRECISION TRAINING

There is a method called mixed-precision training that can accelerate the training of nearly any model by up to three times without additional cost. This might seem unbelievable, but it is true. You can implement mixed-precision training by following some simple steps, and it can be verified on your own. I won’t go into the details of how it works, but here’s how to activate it when using a GPU:

from tensorflow import keras
keras.mixed_precision.set_global_policy("mixed_float16")

In general, when using mixed-precision training, the majority of the forward pass of the model will use float16 data type, except for operations that are numerically unstable, such as softmax.

The model’s weights will be stored and updated in float32 data type. If you want to exclude a specific layer from mixed precision, you can pass the argument “dtype=’float32'” to the layer constructor. It is important to note that this technique may not be suitable for production-level architecture and requires caution. However, when prototyping, it can significantly reduce training time.

MULTI-GPU TRAINING

Once you’re able to import tensorflow on a machine with multiple GPUs, you’re seconds away from training a distributed model. It works like this:

strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
with strategy.scope():
    model = get_compiled_model()   
model.fit(  
    train_dataset,
    epochs=100,
    validation_data=val_dataset,
    callbacks=callbacks)

These lines of code implement a common training setup called single-host, multi-device synchronous training or mirrored distribution strategy in TensorFlow. In this setup, multiple GPUs are present on a single machine and the training process is synchronous, meaning that the state of each GPU’s model replicas remains the same at all times. It’s worth noting that there are other variants of distributed training that don’t maintain this synchronicity.

To achieve the best performance when doing distributed training, you should provide your data as a tf.data.Dataset object. Data prefetching should also be utilised by calling dataset.prefetch(buffer_size) before passing the dataset to fit(). If you’re unsure what buffer size to use, the option dataset.prefetch(tf.data.AUTOTUNE) can automatically select a buffer size for you.

Although it would be ideal for training on N GPUs to result in an N-fold speedup, there is some overhead involved in distribution, particularly in merging the weight deltas from different devices. Therefore, the actual speedup achieved is dependent on the number of GPUs used:

With two GPUs, the speedup stays close to 2x.
With four, the speedup is around 3.8x.
With eight, it’s around 7.3x.

This assumes that you’re using a large enough global batch size to keep each GPU utilised at full capacity. If your batch size is too small, the local batch size won’t be enough to keep your GPUs busy.

TPU TRAINING

You can actually use an 8-core TPU for free in Colab. In the Colab menu, under the Runtime tab, in the Change Runtime Type option, you’ll notice that you have access to a TPU runtime in addition to the GPU runtime.

When you’re using the GPU runtime, your models have direct access to the GPU without you needing to do anything special. This isn’t true for the TPU runtime; there’s an extra step you need to take before you can start building a model: you need to connect to the TPU cluster.

It works like this:

import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())
from tensorflow import keras 
from tensorflow.keras import layers
  
strategy = tf.distribute.TPUStrategy(tpu) 
print(f"Number of replicas: {strategy.num_replicas_in_sync}")
  
def build_model(input_size):
    inputs = keras.Input((input_size, input_size, 3))
    x = keras.applications.resnet.preprocess_input(inputs)
    x = keras.applications.resnet.ResNet50(
        weights=None, include_top=False, pooling="max")(x)
    outputs = layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])
    return model
  
with strategy.scope():
    model = build_model(input_size=32)

Beware of I/O bottlenecks

To avoid data reading speed from becoming a bottleneck when using TPUs, it is advisable to keep small datasets in the memory of the virtual machine by caching them with dataset.cache(). However, if the dataset is too large, it should be stored in TFRecord files to enable quick loading. This can be achieved by formatting the data as TFRecord files, as demonstrated in this code example on keras.(https://keras.io/examples/keras_recipes/creating_tfrecords/).

Another technique to improve TPU utilisation is step fusing, which involves performing multiple training steps during each TPU execution step. The steps_per_execution argument in compile() can be used to specify the number of training steps to execute during each TPU execution step. For small models that are not fully utilizing the TPU, this technique can significantly speed up training.

Key takeaways

Leverage hyperparameter tuning to automate the tedium out of finding the best model configuration — be mindful of validation — set overfitting.

An ensemble of diverse models can often significantly improve the quality of your predictions.

Turn on mixed precision to speed up GPU training

Scale workflows with tf.distribute.Mirrored-Strategy API to train models on multiple GPUs

Google Colab TPU — with TPUStrategy and step fusing

I hope this content will inspire you to try out some new things!

Thank you!

ATTRIBUTIONS

Deep Learning with Python, second edition — Francois Chollet

The Keras source code can be found at https://github.com/keras-team/keras.

Essential techniques for systematically improving model performance

Written by Andreas Stavrou

No responses yet