Pick Multiple Models, Train, and Evaluate to Choose the Best Candidate¶

In this guide, we'll select and train multiple models, evalute the results, and select the best recipe from all possible candidates that meet our performance requirements. Let's set up the GRDB API to work with different recipes. We start with the carsimple volume, which contains road scenes. This volume provides a context for integrating road sign data, as these signs are crucial for understanding traffic environments and improving models for tasks like sign detection and vehicle navigation.

In [1]:

Copied!





import leip_recipe_designer as rd
from pathlib import Path
import shutil
import requests
import os
import logging

logger = logging.getLogger('leip_recipe_designer.tasks')
logger.setLevel(logging.CRITICAL)

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
data = rd.helpers.data.new_pascal_data_generator(
    pantry=pantry,
    root_path="${paths.cache_dir}/road-sign-data",
    images_dir="images",
    annotations_dir="annotations",
    nclasses=4,
    is_split=False,
    trainval_split_ratio=0.80,
    trainval_split_seed=42,
    dataset_name="road-sign-data",
    download_url="https://s3.us-west-1.amazonaws.com/leip-showcase.latentai.io/recipes/andrewmvd_road-sign-detection.zip"
)

volumes = rd.GoldenVolumes().list_volumes_from_zoo()
df = volumes["carsimple"].get_golden_df()
import leip_recipe_designer as rd
from pathlib import Path
import shutil
import requests
import os
import logging

logger = logging.getLogger('leip_recipe_designer.tasks')
logger.setLevel(logging.CRITICAL)

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
data = rd.helpers.data.new_pascal_data_generator(
    pantry=pantry,
    root_path="${paths.cache_dir}/road-sign-data",
    images_dir="images",
    annotations_dir="annotations",
    nclasses=4,
    is_split=False,
    trainval_split_ratio=0.80,
    trainval_split_seed=42,
    dataset_name="road-sign-data",
    download_url="https://s3.us-west-1.amazonaws.com/leip-showcase.latentai.io/recipes/andrewmvd_road-sign-detection.zip"
)

volumes = rd.GoldenVolumes().list_volumes_from_zoo()
df = volumes["carsimple"].get_golden_df()

2024-10-29 14:24:22,223 | WARNING  | pantry.build-119 | You requested to build a Pantry, but haven't specified the desired execution contexts. Therefore, will use the installed ones ['leip_af', 'leip_forge', 'leip_stub_gen']

Skipped downloading goldenrecipedb with name "carsimple" and variant "Carsimplev0.3" (0), as it already exists.

Step 1: Retrieve Candidate Recipes¶

To save time, we'll limit our selection to the best five models that meet our task performance criteria and have the lowest multiply-accumulate (MAC) operations.

First, let's retrieve the Serialized Packaged Portable Recipe (SPPR) for each of the candidate models using the from_sppr method. After deserializing, all we need to do is swap the data ingredient for each recipe with our own dataset.

In [2]:

Copied!





optimal_models = [122826, 120659, 122558, 119323, 119963]

# Extract the recipes
candidate_recipes = {}
for recipe_id in optimal_models:
    sppr = df.query(f"id == {recipe_id}").iloc[0]["sppr"]
    candidate_recipes[recipe_id] = rd.create.from_sppr(sppr, pantry, allow_upgrade=True)

print(f"We have collected the {len(candidate_recipes)} recipes with lowest MACs that meet our task metric performance criteria")
optimal_models = [122826, 120659, 122558, 119323, 119963]

# Extract the recipes
candidate_recipes = {}
for recipe_id in optimal_models:
    sppr = df.query(f"id == {recipe_id}").iloc[0]["sppr"]
    candidate_recipes[recipe_id] = rd.create.from_sppr(sppr, pantry, allow_upgrade=True)

print(f"We have collected the {len(candidate_recipes)} recipes with lowest MACs that meet our task metric performance criteria")

We have collected the 5 recipes with lowest MACs that meet our task metric performance criteria

Step 2: Train the candidate recipes¶

Tip: no need to wait for full training!

Training these recipes until convergence will take several minutes: an NVIDIA RTX A4500 GPU took about 50 minutes to train these 5 models. We understand you may want to continue with the tutorial without waiting for convergence, so we've added a few lines below to limit the training time to just one-tenth of an epoch.

After this short training showcase, you can use a few lines of code to download the trained checkpoints and proceed to the next step of evaluating the recipe.

If you prefer to wait for the full training to converge instead of downloading the checkpoints, remove or comment out the following lines:

recipe["train.num_epochs"] = 1
recipe["trainer.train_batches_percentage"] = 0.1

In [3]:

Copied!





train_outputs = {}
for recipe_id, recipe in candidate_recipes.items():
    # It is necessary to add a logger to our run. Below we are adding a default local log.
    # If you use a different logger, such as Weights and Biases or Neptune,
    # visit our documentation for instructions on how to add it to the recipe
    recipe.assign_ingredients('loggers', {"my_local_training_log": "Tensorboard"})

    # In-place change in the recipe. Swaps the recipe's original training data with your data
    rd.helpers.data.replace_data_generator(recipe, data)

    # THE FOLLOWING LINES CUT TRAINING SHORT, FOR THE SAKE OF TIME. IF YOU HAVE TIME TO WAIT UNTIL CONVERGENCE,
    # COMMENT THE LINE BELOW AND THE MODEL WILL STOP AUTOMATICALLY ONCE IT'S DONE TRAINING
    recipe["train.num_epochs"] = 1
    recipe["trainer.train_batches_percentage"] = 0.1

    # This is a completely optional step. Since we are training multiple
    # recipes, we will use the recipe ID to identify the artifacts generated by
    # this recipe
    recipe["experiment.name"] = f"{recipe_id}"

    # Ensure all slots of the recipe are filled
    recipe.fill_empty_recursively()

    print(f"\n\nTraining recipe {recipe_id}")
    train_output = rd.tasks.train(recipe)
    train_outputs[recipe_id] = train_output

    # After training is finished for a recipe, add the checkpoint path to the
    # recipe, so it can be used by the evaluate task below
    recipe.assign_ingredients("checkpoint", "Local ckpt file")
    recipe['checkpoint.path'] = str(train_output["best_model_path"])
train_outputs = {}
for recipe_id, recipe in candidate_recipes.items():
    # It is necessary to add a logger to our run. Below we are adding a default local log.
    # If you use a different logger, such as Weights and Biases or Neptune,
    # visit our documentation for instructions on how to add it to the recipe
    recipe.assign_ingredients('loggers', {"my_local_training_log": "Tensorboard"})

    # In-place change in the recipe. Swaps the recipe's original training data with your data
    rd.helpers.data.replace_data_generator(recipe, data)

    # THE FOLLOWING LINES CUT TRAINING SHORT, FOR THE SAKE OF TIME. IF YOU HAVE TIME TO WAIT UNTIL CONVERGENCE,
    # COMMENT THE LINE BELOW AND THE MODEL WILL STOP AUTOMATICALLY ONCE IT'S DONE TRAINING
    recipe["train.num_epochs"] = 1
    recipe["trainer.train_batches_percentage"] = 0.1

    # This is a completely optional step. Since we are training multiple
    # recipes, we will use the recipe ID to identify the artifacts generated by
    # this recipe
    recipe["experiment.name"] = f"{recipe_id}"

    # Ensure all slots of the recipe are filled
    recipe.fill_empty_recursively()

    print(f"\n\nTraining recipe {recipe_id}")
    train_output = rd.tasks.train(recipe)
    train_outputs[recipe_id] = train_output

    # After training is finished for a recipe, add the checkpoint path to the
    # recipe, so it can be used by the evaluate task below
    recipe.assign_ingredients("checkpoint", "Local ckpt file")
    recipe['checkpoint.path'] = str(train_output["best_model_path"])


Training recipe 122826
Executing AF command: 
af --config-dir /tmp/tmplxp86xlm --config-name recipe.yaml hydra.job_logging.root.level=50 +command=train

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

model size is  1.0x
init weights...
=> loading pretrained model https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth
Finish initialize NanoDet Head.


Training recipe 120659
Executing AF command: 
af --config-dir /tmp/tmpiw8wc6il --config-name recipe.yaml hydra.job_logging.root.level=50 +command=train

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(


Training recipe 122558
Executing AF command: 
af --config-dir /tmp/tmpnlel67hk --config-name recipe.yaml hydra.job_logging.root.level=50 +command=train

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(


Training recipe 119323
Executing AF command: 
af --config-dir /tmp/tmpb5weq2mv --config-name recipe.yaml hydra.job_logging.root.level=50 +command=train

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(


Training recipe 119963
Executing AF command: 
af --config-dir /tmp/tmpkgx8bxfp --config-name recipe.yaml hydra.job_logging.root.level=50 +command=train

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

Step 3: Download the Checkpoints¶

After waiting for the short training to complete above, you will have training outputs. As mentioned earlier in this tutorial, all tasks return a dictionary pointing to any outputs generated from the task.

Unless you commented out the specified lines, you only trained for 10% of an epoch, so we don't expect your model to have learned much. The cells below will download and extract the checkpoints that were trained on Latent AI servers, and modify your training outputs to point to the downloaded checkpoints instead.

In [4]:

Copied!





# Replace these values with the URL of the file you want to download
file_url = "https://s3.us-west-1.amazonaws.com/leip-showcase.latentai.io/recipes/tutorials/Design_Models.zip"

# Specify the local directory to save the downloaded and extracted files
local_directory = Path("downloaded_checkpoints")

# Create the local directory if it doesn't exist
local_directory.mkdir(exist_ok=True, parents=True)

# Path to save the downloaded file
zip_file_path = local_directory / "Design_Models.zip"

# Download the file
response = requests.get(file_url, stream=True)
if response.status_code == 200:
    with open(zip_file_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

# Extract the contents of the zip file
shutil.unpack_archive(zip_file_path, local_directory)

print("Downloaded and extracted files to:", local_directory)
# Replace these values with the URL of the file you want to download
file_url = "https://s3.us-west-1.amazonaws.com/leip-showcase.latentai.io/recipes/tutorials/Design_Models.zip"

# Specify the local directory to save the downloaded and extracted files
local_directory = Path("downloaded_checkpoints")

# Create the local directory if it doesn't exist
local_directory.mkdir(exist_ok=True, parents=True)

# Path to save the downloaded file
zip_file_path = local_directory / "Design_Models.zip"

# Download the file
response = requests.get(file_url, stream=True)
if response.status_code == 200:
    with open(zip_file_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

# Extract the contents of the zip file
shutil.unpack_archive(zip_file_path, local_directory)

print("Downloaded and extracted files to:", local_directory)

Downloaded and extracted files to: downloaded_checkpoints

The cell below replaces the trained checkpoints with the downloaded ones.

In [5]:

Copied!





file_paths = []
for root, dirs, files in os.walk(local_directory):
    for filename in files:
        file_path = os.path.abspath(os.path.join(root, filename))
        file_paths.append(file_path)

for recipe_id, recipe in candidate_recipes.items():
    for file_path in file_paths:
        if f"{recipe_id}" in file_path and "ckpt" in file_path:
            recipe["model.checkpoint"] = file_path
file_paths = []
for root, dirs, files in os.walk(local_directory):
    for filename in files:
        file_path = os.path.abspath(os.path.join(root, filename))
        file_paths.append(file_path)

for recipe_id, recipe in candidate_recipes.items():
    for file_path in file_paths:
        if f"{recipe_id}" in file_path and "ckpt" in file_path:
            recipe["model.checkpoint"] = file_path

Step 4: Evaluate the Models, Visualize the Predictions, and Pick a Winner¶

Evaluate the models trained using the candidate recipes, and select the best performer.

In [6]:

Copied!





evaluate_outputs = {}
for recipe_id, recipe in candidate_recipes.items():
    eval_output = rd.tasks.evaluate(recipe)
    evaluate_outputs[recipe_id] = eval_output

for recipe_id, scores in evaluate_outputs.items():
    print("Recipe with ID", recipe_id, "has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of", scores["evaluate.metric_single"])
evaluate_outputs = {}
for recipe_id, recipe in candidate_recipes.items():
    eval_output = rd.tasks.evaluate(recipe)
    evaluate_outputs[recipe_id] = eval_output

for recipe_id, scores in evaluate_outputs.items():
    print("Recipe with ID", recipe_id, "has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of", scores["evaluate.metric_single"])

Executing AF command: 
af --config-dir /tmp/tmp9nou8r8c --config-name recipe.yaml hydra.job_logging.root.level=50 +command=evaluate

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

model size is  1.0x
init weights...
=> loading pretrained model https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth
Finish initialize NanoDet Head.
Executing AF command: 
af --config-dir /tmp/tmpy44qz5uy --config-name recipe.yaml hydra.job_logging.root.level=50 +command=evaluate

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

Executing AF command: 
af --config-dir /tmp/tmp1nae7qam --config-name recipe.yaml hydra.job_logging.root.level=50 +command=evaluate

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

Executing AF command: 
af --config-dir /tmp/tmpbh979imi --config-name recipe.yaml hydra.job_logging.root.level=50 +command=evaluate

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

Executing AF command: 
af --config-dir /tmp/tmpb4ieg0kx --config-name recipe.yaml hydra.job_logging.root.level=50 +command=evaluate

/home/sai/miniconda3/envs/latest_dev/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(

Recipe with ID 122826 has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of 0.49281468987464905
Recipe with ID 120659 has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of 0.6422910094261169
Recipe with ID 122558 has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of 0.6373014450073242
Recipe with ID 119323 has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of 0.5658560395240784
Recipe with ID 119963 has a Mean Average Precision score (averaged over IoU Thresholds 0.50:.95:0.05) of 0.613910973072052

The highest task accuracy was obtained by the recipe with ID 120659. Let's select this.

In [7]:

Copied!

selected_recipe = candidate_recipes[120659]
selected_recipe = candidate_recipes[120659]

From here, you can follow the steps shown in the Getting Started tutorial to optimize and deploy your model!