If you would like to add your own data to the mix, LEIP recipes supports easy ingestion of commonly used data formats MS COCO and Pascal. Adding your data to recipes using one of these formats is simply verifying that certain conventions are followed and modifying one of the configuration files to point to the associated components.

Once your data has been provided, the modular nature of LEIP Recipes means that your dataset will be compatible with training future recipes for models of the same type as they are added. This will give you a simple path for trying out various model sizes and architecture for your application in a reproducible way for identifying the best model for your needs.

Bring Your Own COCO Formatted Data

Let's begin with the MS COCO format ingestor. Find the MS COCO-like dataset template file in the docker container located at /latentai/custom-dataset-configs/coco-like-template.yaml

It will look like this:

nclasses: 0 # number of classes in your detection dataset
module:
  _target_: af.core.data.modules.adaptermodule.AdaptorDataModule
  batch_sizes: ${task.batch_sizes}
  num_workers: ${task.num_workers}
  adaptors: ${model.adaptors}
  train_transforms:
    _target_: af.core.data.augmentations.basic.resize
    width: ${task.width}
    height: ${task.height}
  valid_transforms:
    _target_: af.core.data.augmentations.basic.resize
    width: ${task.width}
    height: ${task.height}
  dataset_generator:
      _target_: af.core.data.generic.coco-like.COCOlike
      root_path: notset # full path to your dataset
      download_from_url: None
      task_family: detection
      train_images_dir: train 
      val_images_dir: validation 
      train_annotations_dir: annotations/instances_train.json 
      val_annotations_dir: annotations/instances_val.json 
YAML

The fields you will need to change are for the number of classes in your dataset and the dataset_generator shown here from line 15 on. Changing those fields with your dataset’s information is enough to load and use your COCO-formatted dataset.

  • root_path: full path to the root directory of the dataset

  • download_from_url: if your data is compressed to a single file, and resides in the cloud (public S3 bucket, Drive, etc.) the ingestor can download it and place it inside root_path

  • task_family: detection or segmentation (COCO annotations supports both)

  • train_images_dir: path to folder containing only training images, relative to root_path

  • val_images_dir: path to folder containing only validation images, relative to root_path

  • train_annotations_dir: path to .json file of training annotations, relative to root_path

  • val_annotations_dir: path to .json file of validation annotations, relative to root_path

  • nclasses: number of classes in your dataset

By default, everything except the root_path is set to match the COCO dataset defaults. The folder structure is expected to be:

   path/to/mydataset/                
                    |---train
                        |---image1.jpeg
                        |---image2.jpeg
                            ...
                    |---validation
                        |---image8.jpeg
                        |---image9.jpeg
                            ...
                    |---annotations
                        |---instances_train.json
                        |---instances_val.json
CODE

Now that you have customized the YAML file for your dataset, you can use it as a component by passing its name to the data parameter, like this:

af --config-name=yolov5_L_RT data=coco-like-template 'hydra.searchpath=[file://custom-configs]' command=train task.moniker="BYOD_recipe"
BASH

A concrete BYOD COCO-like example

Let’s look at the SODA10M dataset as an example. Without downloading it, we can see that the folder structure does not exactly match the LEIP Recipe defaults:

Let's assume you downloaded and extracted the files to /home/data/soda10m.

Notice the slight difference in annotation files: they use instance_train.json instead of the plural instances_train.json

A YAML file needs to be made for this dataset. Let's make a new file in custom-configs/data/soda10m_config.yaml and copy the contents from our template to start.

Then, lets modify the fields in custom-configs/data/soda10m_config.yaml to match the needs of this dataset.

The soda10m_config.yaml will look like this:

nclasses: 6
module:
  _target_: af.core.data.modules.adaptermodule.AdaptorDataModule
  batch_sizes: ${task.batch_sizes}
  num_workers: ${task.num_workers}
  adaptors: ${model.adaptors}
  train_transforms:
    _target_: af.core.data.augmentations.basic.resize
    width: ${task.width}
    height: ${task.height}
  valid_transforms:
    _target_: af.core.data.augmentations.basic.resize
    width: ${task.width}
    height: ${task.height}
  dataset_generator:
      _target_: af.core.data.generic.coco-like.COCOlike
      root_path: /home/data/soda10m/SSLAD-2D/Labeled
      task_family: "detection"
      val_images_dir: "val"
      train_images_dir: "train"
      train_annotations_dir: "annotations/instance_train.json"
      val_annotations_dir: "annotations/instance_val.json"
YAML

You can now train on this dataset by running the following:

af --config-name=yolov5_L_RT data=soda10m_config 'hydra.searchpath=[file://custom-configs]' \
 command=train task.moniker="BYOD_recipe"
BASH
  • --config-name=yolov5_L_RT : select the yolov5_L_RT recipe

  • data=soda10m_config : use the data/soda10m_config.yaml file as the data config for this recipe

  • 'hydra.searchpath=[file://custom-configs]' : Add the path custom-configs to the search path when finding the .yaml

    • It's important to ensure the .yaml is nested inside an additional data folder in the added search path.

  • task.moniker="BYOD_recipe" Gives this run a different name, so we dont confuse its artifacts with the pretrained.

Your trained checkpoint will be stored at the location specified in the logs. It will be inside /latentai/outputs/{date}_BYOD_recipe/checkpoints/*.

Bring Your Own Pascal VOC Formatted Data

Training with a Pascal VOC formatted dataset is very similar to the using COCO formatted data. Perform the following to train your model with Pascal VOC formatted data:

Find the Pascal-like dataset template file in the docker container located at

/latentai/custom-dataset-configs/data/pascal-like-template.yaml

It will look like this:

nclasses: 0
module:
  _target_: af.core.data.modules.adaptermodule.AdaptorDataModule
  batch_sizes: ${task.batch_sizes}
  num_workers: ${task.num_workers}
  adaptors: ${model.adaptors}
  train_transforms:
    _target_: af.core.data.augmentations.basic.very_light
    width: ${task.width}
    height: ${task.height}
  valid_transforms:
    _target_: af.core.data.augmentations.basic.resize_norm_imagenet
    width: ${task.width}
    height: ${task.height}
  dataset_generator:
    _target_: af.core.data.generic.pascal-like.PASCALlike
    root_path: notset # when downloading a dataset, a good default is to use ${paths.cache_dir}
    images_dir: JPEGImages
    annotations_dir: Annotations
    type: detection
    is_split: false
    trainval_split_ratio: 0.75
    trainval_split_seed: 42
    train_set: ImageSets/Main/train.txt
    val_set: ImageSets/Main/val.txt
    labelmap_file: pascal_label_map.pbtxt
    download_from_url: false
    dataset_name: my-custom-pascal-like-data
YAML

The fields you will need to change are for the number of classes in your dataset, and the dataset_generator shown here from line 15 on. Changing those fields with your dataset’s information is enough to load and use your PascalVOC-formatted dataset.

  • root_path: full path to the root directory of the dataset

  • images_dir: path to folder containing only images, relative to root_path

  • annotations_dir: path to folder containing xml files, relative to root_path

  • type: detection or segmentation. For this detection recipe, we can leave the value as detection

  • is_split: true or false.

    • If set to true, lists with samples for training and validation should be specified using train_set and val_set

    • If set to false, data will be split by the ingestor given the trainval_split_ratio and trainval_split_seed

  • trainval_split_ratio: ratio to use to split the dataset. Used only if is_split: false

  • trainval_split_seed: seed to use to pseudo randomly split the dataset. Used only if is_split: false

  • train_set: path to text file containing names (no extensions) to the training samples. Used only if is_split: true

  • val_set: path to text file containing names (no extensions) to the validation samples. Used only if is_split: true

  • labelmap_file: (optional) path to file containing a map from class index (int) to class name (string), relative to root_path. If you dont have this file, it will be created automatically and stored in the specified directory.

  • download_from_url: if your data is compressed to a single file, and resides in the cloud (public S3 bucket, Drive, etc.) the ingestor can download it and place it inside root_path

  • nclasses: number of classes in your dataset

A Concrete BYOD Pascal VOC Format Example

For an example of the usage with a real dataset, see custom-configs/data/smoke-pascal-like.yaml. You can run the example data by:

af --config-name=yolov5_L_RT data=smoke-pascal-like 'hydra.searchpath=[file://custom-configs]' \
 command=train task.moniker="BYOD_recipe"
BASH

This example has a download_from_url defined with the path to the data, so the data will get downloaded automatically into root_path.


Next, let's evaluate the BYOD model on host and export it to use with the SDK.