Skip to main content
Skip table of contents

Advanced Application Framework Options

The Latent AI Machine Learning Application Framework (AF) is a modular framework that enables users to solve machine learning problems by bringing their own data and using those datasets to quickly train and evaluate different models to select the best performing model that meets their design requirements. Models exported from AF can be optimized, compiled and evaluated on target edge hardware to verify criteria are met.

When used with LEIP Recipes, AF default configurations are provided that are designed to provide good performance to a broad set of applications. Depending on your dataset, you may need to change the defaults, such as changing the input shapes. You may also wish to alter parameters to explore different learning rates, or to trade off accuracy for fast training, or vice-versa.

AF builds on top of many component technologies, including Hydra and Pytorch Lightning, giving users configurable access to many underlying components.

We recommend you start with LEIP Recipes to gain experience with AF and a number of available models that not only work with AF out-of-the-box, but are also guaranteed to compile, optimize, and run on many different hardware platforms. If you would like to experiment with different parameters to find more optimal settings for your dataset, the following will give you an introduction to the underlying capabilities. If you have more specific needs, or different models you would like to use in this modular fashion, please contact us at Latent AI.

Basic commands

AF supports a set of commands/modes, and each is configurable with various aspects of the ML process:

  • train: Train a model

  • evaluate: Evaluate a trained model

  • predict: Visualize and summarize the predictions of a trained model

  • vizdata: Visualize the input data to verify that it is correctly ingested

  • export: Export a trained model for further processing in the LEIP SDK (i.e. compile, optimize).

Each mode can be set as shown here (evaluate mode as example):

CODE
af [...] command=evaluate

Historically, the default mode (i.e., if no command is specified) is train. This behavior is deprecated and will be removed in future versions of AF.

How do I?

Train a Model With My Data: BYOD

BYOD instructions differ depending on the type of recipe. Instructions are available for both Classifier and Detector models.

Specify a Trained Checkpoint for Further Processing (Evaluate, Visualize Predictions, Export)

The training process will generate checkpoints of the best models as they are being trained. They end up in the artifacts/ folder (e.g., if your current folder is /latentai):

CODE
/latentai/artifacts/train/2022-06-15_13-53-38_task_leip_classifier/epoch=2-step=303.ckpt

Use the following syntax in order to specify such an existing checkpoint for continuing training, exporting, evaluating or visualizing:

CODE
af [...] +checkpoint=<absolute qualified path of .ckpt file>

The pathname has to be absolute and not relative to the current working directory.

If you would like to change the export file, for example to simplify scripting for automatic test and integration, you can use the following option:
callbacks.checkpoint.filename=<checkpoint filename>

Export a Model

To export a pre-trained model (e.g., the YOLOv5 pre-trained on MSCOCO), call it with the same configuration used for training and add command=export

CODE
af [...] command=export 

Notice that if you do not provide a +checkpoint=</absolute/path/to.ckpt>, the AF knows to pull in the pretrained weights of the model by default.

Export a Trained Model

To export a newly trained model, locate the checkpoint that you would like to export and call it with the same configuration used for training:

CODE
af [...] command=export +checkpoint=<absolute qualified path of .ckpt file>

The default location for the exported .pt file will be:

CODE
./artifacts/export/[task.moniker]_[backbone]_[batch_size]x[height]x[width].pt

Example: leip_classifier_ptcv-mobilenetv2_w1_1x224x224.pt

Evaluate with a Trained Model Checkpoint

Locate the checkpoint that you would like to evaluate and call it with the same configuration used for training. Call with command=evaluate

CODE
af [...] command=evaluate +checkpoint=<absolute qualified path of .ckpt file>

The AF will predict and run evaluation metrics over the entire validation set. At the end, you will see a metrics report, which will also be exported to /latentai/artifacts/evaluate/<data_name>/metrics_report.json.

Visualize Predictions with a Trained Model Checkpoint

Locate the checkpoint that you’d like to predict with and visualize and call it with the same configuration used for training. Call with command=predict

CODE
af [...] command=predict +checkpoint=<absolute qualified path of .ckpt file>

Change the Learning Rate

CODE
af [...] model.module.optimizer.lr=0.1

Change the Classifier

CODE
af [...] model.module.backbone=timm:visformer_small

Change the Processing Resolution

CODE
af [...] task.width=384 task.height=384

Change the Batch Size

CODE
af [...] task.bs_train=16 task.bs_val=64

Change the Optimizer

CODE
af [...] model.module.optimizer=timm.adamw

Add ML Metrics Logging -- Tensorboard

CODE
af [...] +loggers@loggers=tensorboard

The logs will be stored in the configured experiment output folder, ./outputs by default.

Add ML Metrics Logging -- Neptune.AI

CODE
af [...] +loggers@loggers=neptune loggers.neptune.project="<your_neptune_project_id>"

Note: You have to provide your Neptune credentials in NEPTUNE_API_TOKEN, refer to https://docs.neptune.ai/getting-started/installation#authentication-neptune-api-token . The logs will be stored in your Neptune project.

Change the Learning Rate Scheduler

Change on command line (this is a group of values, so the syntax differs):

CODE
af [...] model/module/scheduler=OneCycle

Increase the Number of Training Epochs

CODE
af [...] trainer.max_epochs=42

Limit the Training Time

Limit to 2 hours and 42 minutes of total training time:

CODE
af [...] trainer.max_time="00:02:42:00"

Change the Display and Log Metrics

For the classifiers:

CODE
# add one or more metrics
+model/module/metrics@model.module.metrics=[AUROC,AveragePrecision]

# override to one or more metrics
model/module/metrics=[Accuracy,AUROC,AveragePrecision]

Train with Multiple GPUs on One Machine

CODE
# use all available gpus
af [...] trainer.devices=-1

# use first and third available gpus
af [...] trainer.devices=[0,2]

# use two gpus
af [...] trainer.devices=2

Get More Debug Output in the Console

CODE
af [...] hydra.verbose=[af]

When Does Training Stop?

Training generally stops when either of the following conditions are met:

  1. trainer.max_epochs is reached

  2. trainer.max_time is reached

  3. An early termination callback is enabled and its conditions are met, for example EarlyStopping based on val_loss_epoch.

  4. The user hits Ctrl-C


Available Options

Optimizer

How to Configure

Change on command line:

CODE
af [...] model.module.optimizer.moniker=timm.adamw

Change in YAML:

CODE
model:
  module:
    optimizer:
      moniker: timm.adamw

Supported Values

CODE
torch:Adadelta
torch:Adagrad
torch:Adam
torch:AdamW
torch:SparseAdam
torch:Adamax
torch:ASGD
torch:LBFGS
torch:NAdam
torch:RAdam
torch:RMSprop
torch:Rprop
torch:SGD
timm:sgd
timm:nesterov
timm:momentum
timm:sgdp
timm:adam
timm:adamw
timm:adamp
timm:nadam
timm:radam
timm:adamax
timm:adabelief
timm:radabelief
timm:adadelta
timm:adagrad
timm:adafactor
timm:lamb
timm:lambc
timm:larc
timm:lars
timm:nlarc
timm:nlars
timm:madgrad
timm:madgradw
timm:novograd
timm:nvnovograd
timm:rmsprop
timm:rmsproptf

Schedulers

How to Configure

The schedulers can be configured either via the command line OR by modifying the recipe YAML file directly.

Note: internally schedulers are groups of values, so the syntax for command line and YAML file changes is different than changes to single values.

Change the scheduler on command line:

CODE
af [...] model/module/scheduler=OneCycle

Supported Values

CODE
ExponentialDecay
ExponentialDecayScaled
OneCycle
OneCycleAnnealed
model/module/scheduler=ExponentialDecayScaled

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ExponentialLR.html#exponentiallr

Parameter

Parameter Explanation

model.module.optimizer.lr

Starting LR (eg. 0.01)

model.module.scheduler.actual.gamma

Decay rate (eg 0.95)

model.module.scheduler.actual.[XXX]

Any other parameter of torch scheduler

model/module/scheduler=ExponentialDecayScaled

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ExponentialLR.html#exponentiallr

Parameter

Parameter Explanation

trainer.max_epochs

Max epoch for LR scaling (e.g. 42)

model.module.optimizer.lr

Starting LR (e.g. 0.1)

model.module.scheduler._recipe_.lr_end

Ending LR at last epoch (e.g. 0.0001)

model.module.scheduler.actual.[XXX]

Any other parameter of torch scheduler

model/module/scheduler=OneCycle

Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html

Parameter

Parameter Explanation

trainer.max_epochs

Max epoch for LR scheduling (e.g. 42)

model.module.optimizer.lr

Max LR (e.g. 0.1)

model.module.scheduler.actual.[XXX]

Any parameter of torch scheduler

model/module/scheduler=OneCycleAnnealed

Two phase LR: (1) OneCycle → (2) constant LR annealing

References:

Parameter

Parameter Explanation

model.module.scheduler._recipe_.epochs_onecycle

# of epochs for initial OneCycle

model.module.scheduler._recipe_.constant_lr_factor

factor of initial LR to anneal

model.module.optimizer.lr

Max LR (e.g. 0.1)

model.module.scheduler.actual.[XXX]

Any parameter of torch sequential scheduler

model.module.scheduler.actual._schedulers_[0].[XXX]

Any parameter of first phase OneCycle scheduler

model.module.scheduler.actual._schedulers_[1].[XXX]

Any parameter of second phase constant scheduler

model/module/scheduler=ReduceOnPlateau
model/module/scheduler=CosineAnnealing

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.