Skip to main content
Skip table of contents

Classifier Recipe Step One: Training, Evaluating, and Exporting a Model

The default Classifier configuration supports a large number of classifier models that have been qualified to ensure that they compress and compile for common hardware targets. We will walk through an example where we train and evaluate some of these models on a small open images dataset. This dataset will enable us to quickly demonstrate each step of the process and give us a decent baseline accuracy for verification purposes.

Later in the tutorial we will provide instructions on adding your own dataset and how to adjust the model and training parameters so that you can adapt the recipe to your needs. We will start with 224x224 image datasets because this is a common configuration that works across all of the currently supported backbones.

Download the Dataset

To get started, download and install the open-images-10-classes dataset:

CODE
cd /latentai
leip zoo download --dataset_id open-images-10-classes --variant_id eval
leip zoo download --dataset_id open-images-10-classes --variant_id train

Train a Model

The machine learning component of a recipe is defined by a configuration file. Additional details of the file will be discussed later, but for now note that the default file we will use is classifier-recipe.yaml and it will be passed to the LEIP Machine Learning Applications Framework tool using the --config-name classifier-recipe command line option.

We will start by training one of the smaller classifier models, timm:gernet_m (See also the list of all classifier models currently supported by LEIP Recipes). We will pass the backbone with the model.module.backbone command to train this model. The open-images-10-classes dataset is already set as the default in the classifier-recipe. Use the following command to start the training process:

CODE
# Example training command for timm:gernet_m, with the default classifier dataset:
# af --config-name classifier-recipe model.module.backbone="timm:gernet_m"
#
# The above command will by default store a checkpoint in a time-stamped directory:
# /latentai/artifacts/train/<timestamp>_task_leip_classifier/<checkpoint file>
#
# For this tutorial, we will pass a hard-coded checkpoint filename to simplify
# locating the checkpoint in the later steps.  We can do this by passing the
# callbacks.checkpoint.filename command line option

# Use the following to train the timm:gernet_m backbone classifier:

af --config-name classifier-recipe model.module.backbone="timm:gernet_m" \
  callbacks.checkpoint.filename=/latentai/checkpoint/timm_gernet_m \
  command=train
  
# The checkpoint file will be the callback above with ".ckpt" added.
# Set an env variable to simplify the following steps:

export CHECKPOINT=/latentai/checkpoint/timm_gernet_m.ckpt

Evaluate the Model

You can evaluate the trained model from the checkpoint by passing the checkpoint path:

CODE
af --config-name classifier-recipe \
  model.module.backbone=timm:gernet_m \
  +checkpoint=$CHECKPOINT command=evaluate

An evaluation report will be provided in:
/latentai/artifacts/evaluate/open-images-10-classes/val/metrics_report.json

Visualize Predictions

You can visualize model predictions on a small set of images:

CODE
af --config-name classifier-recipe \
  model.module.backbone=timm:gernet_m   \
  +checkpoint=$CHECKPOINT command=predict

The images with superimposed labels will be in the directory:

/latentai/artifacts/predictions/open-images-10-classes/validation

Export the Trained Model

Finally, you can use the export command to export the trained model:

CODE
af --config-name classifier-recipe \
  model.module.backbone=timm:gernet_m \
  +checkpoint=$CHECKPOINT command=export

The exported, trained model will be found at:

/latentai/artifacts/export/leip_classifier_timm-gernet_m_batch1_224-224/traced_model.pt

Next, we will compile and optimize the traced model for evaluation on the target device. We have provided instructions for adding your own data to the recipe if you would like to retrain the model with your own data.


Troubleshooting:

If the af commands fail on a preconfigured recipe, the cause is most likely insufficient memory. If the commands fail:

  • Ensure that you have sufficient memory available in your system.

  • Ensure that other Docker containers are not competing for resources.

  • Ensure any GPU card you are using are not in use by other processes

Use the --ipc=host option when launching the Docker container (an append to the Docker run command) to allocate the maximum amount of RAM to the container.

Follow the steps listed below to reduce the recipe’s demand on your system If your system does not have at least 32GB of RAM.

Use the --gpus all option when launching the Docker container (on multi-GPU machines to provide access to all the GPUs. Before launching a command, determine which GPUs are free by using the nvidia-smi Linux command. Then use the CUDA_VISIBLE_DEVICES= environment variable to expose which GPUs are free to utilize the af command.

Reduce Demands on Your Host Hardware:

The task settings in the recipe correlate to your hardware. You may tweak these to better adapt to your hardware resources.

  • task.batch_sizes The default batch size is [8,8] (8 samples during training and 8 samples during evaluation). The optimal value for this parameter depends on the amount of RAM you are able to allocate to the container. A batch size of 8 correlates with a requirement of at least 12GB of RAM allocated to the container. If allocating this amount of RAM is not possible, you may reduce the batch size (and therefore reduce RAM requirements) by appending task.batch_sizes=[4,4] to the commands above.

  • task.num_workers The default value is 4. The optimal value for this parameter is a bit trickier to determine, but a good place to start is using the number of CPU cores in your machine (source). If your CPU has a different number of cores, you may override the default by appending task.num_workers=8 in the commands listed above.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.