Classifier Recipe Step One: Training, Evaluating, and Exporting a Model
The default Classifier configuration supports a large number of classifier models that have been qualified to ensure that they compress and compile for common hardware targets. We will walk through an example where we train and evaluate some of these models on a small open images dataset. This dataset will enable us to quickly demonstrate each step of the process and give us a decent baseline accuracy for verification purposes.
Later in the tutorial we will provide instructions on adding your own dataset and how to adjust the model and training parameters so that you can adapt the recipe to your needs. We will start with 224x224 image datasets because this is a common configuration that works across all of the currently supported backbones.
Download the Dataset
To get started, download and install the open-images-10-classes
dataset:
cd /latentai
leip zoo download --dataset_id open-images-10-classes --variant_id eval
leip zoo download --dataset_id open-images-10-classes --variant_id train
Train a Model
The machine learning component of a recipe is defined by a configuration file. Additional details of the file will be discussed later, but for now note that the default file we will use is classifier-recipe.yaml
and it will be passed to the LEIP Machine Learning Applications Framework tool using the --config-name classifier-recipe
command line option.
We will start by training one of the smaller classifier models, timm:gernet_m
(See also the list of all classifier models currently supported by LEIP Recipes). We will pass the backbone with the model.module.backbone
command to train this model. The open-images-10-classes
dataset is already set as the default in the classifier-recipe. Use the following command to start the training process:
# Example training command for timm:gernet_m, with the default classifier dataset:
# af --config-name classifier-recipe model.module.backbone="timm:gernet_m"
#
# The above command will by default store a checkpoint in a time-stamped directory:
# /latentai/artifacts/train/<timestamp>_task_leip_classifier/<checkpoint file>
#
# For this tutorial, we will pass a hard-coded checkpoint filename to simplify
# locating the checkpoint in the later steps. We can do this by passing the
# callbacks.checkpoint.filename command line option
# Use the following to train the timm:gernet_m backbone classifier:
af --config-name classifier-recipe model.module.backbone="timm:gernet_m" \
callbacks.checkpoint.filename=/latentai/checkpoint/timm_gernet_m \
command=train
# The checkpoint file will be the callback above with ".ckpt" added.
# Set an env variable to simplify the following steps:
export CHECKPOINT=/latentai/checkpoint/timm_gernet_m.ckpt
Evaluate the Model
You can evaluate the trained model from the checkpoint by passing the checkpoint path:
af --config-name classifier-recipe \
model.module.backbone=timm:gernet_m \
+checkpoint=$CHECKPOINT command=evaluate
An evaluation report will be provided in:/latentai/artifacts/evaluate/open-images-10-classes/val/metrics_report.json
Visualize Predictions
You can visualize model predictions on a small set of images:
af --config-name classifier-recipe \
model.module.backbone=timm:gernet_m \
+checkpoint=$CHECKPOINT command=predict
The images with superimposed labels will be in the directory:
/latentai/artifacts/predictions/open-images-10-classes/validation
Export the Trained Model
Finally, you can use the export command to export the trained model:
af --config-name classifier-recipe \
model.module.backbone=timm:gernet_m \
+checkpoint=$CHECKPOINT command=export
The exported, trained model will be found at:
/latentai/artifacts/export/leip_classifier_timm-gernet_m_batch1_224-224/traced_model.pt
Next, we will compile and optimize the traced model for evaluation on the target device. We have provided instructions for adding your own data to the recipe if you would like to retrain the model with your own data.
Troubleshooting:
If the af
commands fail on a preconfigured recipe, the cause is most likely insufficient memory. If the commands fail:
Ensure that you have sufficient memory available in your system.
Ensure that other Docker containers are not competing for resources.
Ensure any GPU card you are using are not in use by other processes
Use the --ipc=host
option when launching the Docker container (an append to the Docker run command) to allocate the maximum amount of RAM to the container.
Follow the steps listed below to reduce the recipe’s demand on your system If your system does not have at least 32GB of RAM.
Use the --gpus all
option when launching the Docker container (on multi-GPU machines to provide access to all the GPUs. Before launching a command, determine which GPUs are free by using the nvidia-smi
Linux command. Then use the CUDA_VISIBLE_DEVICES=
environment variable to expose which GPUs are free to utilize the af
command.
Reduce Demands on Your Host Hardware:
The task settings in the recipe correlate to your hardware. You may tweak these to better adapt to your hardware resources.
task.batch_sizes
The default batch size is[8,8]
(8 samples during training and 8 samples during evaluation). The optimal value for this parameter depends on the amount of RAM you are able to allocate to the container. A batch size of 8 correlates with a requirement of at least 12GB of RAM allocated to the container. If allocating this amount of RAM is not possible, you may reduce the batch size (and therefore reduce RAM requirements) by appendingtask.batch_sizes=[4,4]
to the commands above.task.num_workers
The default value is4
. The optimal value for this parameter is a bit trickier to determine, but a good place to start is using the number of CPU cores in your machine (source). If your CPU has a different number of cores, you may override the default by appendingtask.num_workers=8
in the commands listed above.