(v2.6.0) BYOD: Training With Your Own Data
If you would like to add your own data to the mix, LEIP recipes supports easy ingestion of commonly used data formats MS COCO and Pascal. Adding your data to recipes using one of these formats is simply verifying that certain conventions are followed and modifying one of the configuration files to point to the associated components.
Once your data has been provided, the modular nature of LEIP Recipes means that your dataset will be compatible with training future recipes for models of the same type as they are added. This will give you a simple path for trying out various model sizes and architecture for your application in a reproducible way for identifying the best model for your needs.
Bring Your Own COCO Formatted Data
Let's begin with the MS COCO format ingestor. Find the MS COCO-like dataset template file in the docker container located at /latentai/custom-configs/data/coco-like-template.yaml
It will look like this:
nclasses: 0 # number of classes in your detection dataset
module:
_target_: af.core.data.modules.adaptermodule.AdaptorDataModule
batch_sizes: ${task.batch_sizes}
num_workers: ${task.num_workers}
adaptors: ${model.adaptors}
train_transforms:
_target_: af.core.data.augmentations.basic.resize
width: ${task.width}
height: ${task.height}
valid_transforms:
_target_: af.core.data.augmentations.basic.resize
width: ${task.width}
height: ${task.height}
dataset_generator:
_target_: af.core.data.generic.coco-like.COCOlike
root_path: notset # absolute path to your dataset
download_from_url: None
task_family: detection
train_images_dir: train
val_images_dir: validation
train_annotations_json: annotations/instances_train.json
val_annotations_json: annotations/instances_val.json
label_indexing: 0-indexed-no-background
dataset_name: my-custom-coco-like-data
The fields you will need to change are for the number of classes in your dataset and the dataset_generator
shown here from line 15 on. Changing those fields with your dataset’s information is enough to load and use your COCO-formatted dataset.
nclasses
: number of classes in your datasetroot_path
: absolute path to the root directory of the datasetdownload_from_url
: if your data is compressed to a single file, and resides in the cloud (public S3 bucket, Drive, etc.) the ingestor can download it and place it insideroot_path
task_family
: detection or segmentation (COCO annotations supports both)train_images_dir
: path to folder containing only training images, relative to root_pathval_images_dir
: path to folder containing only validation images, relative to root_pathtrain_annotations_json
: path to .json file of training annotations, relative to root_pathval_annotations_json
: path to .json file of validation annotations, relative to root_pathlabel_indexing
: Are the labels 0 indexed? Is there a background class? One of0-indexed-no-background
,1-indexed-no-background
,0-indexed-with-background.
Depending on your data’s label_indexing, the AF may implement a label shift to ensure that if a background class is present it must be at index 0, and other classes start at index 1.dataset_name
: a string. Will be used to name any generated artifacts.
By default, everything except the root_path
is set to match the COCO dataset defaults. The folder structure is expected to be:
path/to/mydataset/
|---train
|---image1.jpeg
|---image2.jpeg
...
|---validation
|---image8.jpeg
|---image9.jpeg
...
|---annotations
|---instances_train.json
|---instances_val.json
Now that you have customized the YAML file for your dataset, you can use it as a component by passing its name to the data
parameter, like this:
af --config-name=yolov5_L_RT data=coco-like-template 'hydra.searchpath=[file://custom-configs]' command=train task.moniker="BYOD_recipe"
A concrete BYOD COCO-like example
Let’s look at the SODA10M dataset as an example. Without downloading it, we can see that the folder structure does not exactly match the LEIP Recipe defaults:
Let's assume you downloaded and extracted the files to /home/data/soda10m
.
Notice the slight difference in annotation files: they use instance_train.json
instead of the plural instances_train.json
A YAML file needs to be made for this dataset. Let's make a new file in custom-configs/data/soda10m_config.yaml
and copy the contents from our template to start.
Then, lets modify the fields in custom-configs/data/soda10m_config.yaml
to match the needs of this dataset.
The soda10m_config.yaml
will look like this:
nclasses: 6
module:
_target_: af.core.data.modules.adaptermodule.AdaptorDataModule
batch_sizes: ${task.batch_sizes}
num_workers: ${task.num_workers}
adaptors: ${model.adaptors}
train_transforms:
_target_: af.core.data.augmentations.basic.resize
width: ${task.width}
height: ${task.height}
valid_transforms:
_target_: af.core.data.augmentations.basic.resize
width: ${task.width}
height: ${task.height}
dataset_generator:
_target_: af.core.data.generic.coco-like.COCOlike
root_path: /latentai/soda10m/SSLAD-2D/Labeled
download_from_url: null
task_family: "detection"
val_images_dir: "val"
train_images_dir: "train"
train_annotations_json: "annotations/instance_train.json"
val_annotations_json: "annotations/instance_val.json"
label_indexing: 0-indexed-no-background
dataset_name: soda10m
You can now train on this dataset by running the following:
af --config-name=yolov5_L_RT data=soda10m_config 'hydra.searchpath=[file://custom-configs]' \
command=train task.moniker="BYOD_recipe"
--config-name=yolov5_L_RT
: select theyolov5_L_RT
recipedata=soda10m_config
: use thedata/soda10m_config.yaml
file as the data config for this recipe'hydra.searchpath=[file://custom-configs]'
: Add the pathcustom-configs
to the search path when finding the.yaml
It's important to ensure the
.yaml
is nested inside an additionaldata
folder in the added search path.
task.moniker="BYOD_recipe"
Gives this run a different name, so we dont confuse its artifacts with the pretrained.
Your trained checkpoint will be stored at the location specified in the logs. It will be inside /latentai/outputs/{date}_BYOD_recipe/checkpoints/*
.
Bring Your Own Pascal VOC Formatted Data
Training with a Pascal VOC formatted dataset is very similar to the using COCO formatted data. Perform the following to train your model with Pascal VOC formatted data:
Find the Pascal-like dataset template file in the docker container located at
/latentai/custom-dataset-configs/data/pascal-like-template.yaml
It will look like this:
nclasses: 0
module:
_target_: af.core.data.modules.adaptermodule.AdaptorDataModule
batch_sizes: ${task.batch_sizes}
num_workers: ${task.num_workers}
adaptors: ${model.adaptors}
train_transforms:
_target_: af.core.data.augmentations.basic.very_light
width: ${task.width}
height: ${task.height}
valid_transforms:
_target_: af.core.data.augmentations.basic.resize_norm_imagenet
width: ${task.width}
height: ${task.height}
dataset_generator:
_target_: af.core.data.generic.pascal-like.PASCALlike
root_path: notset # when downloading a dataset, a good default is to use ${paths.cache_dir}
images_dir: JPEGImages
annotations_dir: Annotations
type: detection
is_split: false
trainval_split_ratio: 0.75
trainval_split_seed: 42
train_set: ImageSets/Main/train.txt
val_set: ImageSets/Main/val.txt
labelmap_file: pascal_label_map.pbtxt
download_from_url: false
dataset_name: my-custom-pascal-like-data
The fields you will need to change are for the number of classes in your dataset, and the dataset_generator
shown here from line 15 on. Changing those fields with your dataset’s information is enough to load and use your PascalVOC-formatted dataset.
nclasses
: number of classes in your datasetroot_path
: absolute path to the root directory of the datasetimages_dir
: path to folder containing only images, relative to root_pathannotations_dir
: path to folder containing xml files, relative to root_pathtype
:detection
orsegmentation
. For this detection recipe, we can leave the value asdetection
is_split
:true
orfalse
.If set to
true
, lists with samples for training and validation should be specified usingtrain_set
andval_set
If set to
false
, data will be split by the ingestor given thetrainval_split_ratio
andtrainval_split_seed
trainval_split_ratio
: ratio to use to split the dataset. Used only ifis_split: false
trainval_split_seed
: seed to use to pseudo randomly split the dataset. Used only ifis_split: false
train_set
: path to text file containing names (no extensions) to the training samples. Used only ifis_split: true
val_set
: path to text file containing names (no extensions) to the validation samples. Used only ifis_split: true
labelmap_file
: (optional) path to file containing a map from class index (int) to class name (string), relative to root_path. If you dont have this file, it will be created automatically and stored in the specified directory.download_from_url
: if your data is compressed to a single file, and resides in the cloud (public S3 bucket, Drive, etc.) the ingestor can download it and place it insideroot_path
label_indexing
: Are the labels 0 indexed? Is there a background class? One of0-indexed-no-background
,1-indexed-no-background
,0-indexed-with-background.
Depending on your data’s label_indexing, the AF may implement a label shift to ensure that if a background class is present it must be at index 0, and other classes start at index 1.dataset_name
: a string. Will be used to name any generated artifacts.
A Concrete BYOD Pascal VOC Format Example
For an example of the usage with a real dataset, see custom-configs/data/smoke-pascal-like.yaml
. You can run the example data by:
af --config-name=yolov5_L_RT data=smoke-pascal-like 'hydra.searchpath=[file://custom-configs]' \
command=train task.moniker="BYOD_recipe"
This example has a download_from_url
defined with the path to the data, so the data will get downloaded automatically into root_path
.
Next, let's evaluate the BYOD model on host and export it to use with the SDK.