CUDA Graphs with LatentRuntimeEngine

CUDA Graphs allow sequences of GPU operations to be recorded and executed as a single graph, significantly reducing kernel launch overhead and improving performance. This guide explains how to effectively use CUDA Graphs with LatentRuntimeEngine for optimized inference.

What are CUDA Graphs?

CUDA Graphs capture entire workflows as computational graphs that can be executed repeatedly with minimal overhead. Benefits include:

Reduced launch overhead: Eliminates per-kernel launch costs
Improved performance: Faster execution through graph optimization
Consistent timing: More predictable inference latency
Memory efficiency: Pre-allocated buffers reduce allocation overhead

When to Use CUDA Graphs

CUDA Graphs are ideal for scenarios with:

Fixed input shapes: Models with consistent input dimensions
Repeated inference: Multiple inferences with the same model
Performance-critical applications: Real-time or high-throughput requirements
Batch processing: Fixed batch size operations
Production deployments: Optimized inference pipelines

Basic Usage

Enabling CUDA Graphs

from pylre import LatentRuntimeEngine as LRE

# Enable CUDA Graphs during initialization
lre = LRE(
      "model.onnx",
      execution_provider="tensorrt",
      enable_cuda_graph=True
)

Complete Example

import torch
import numpy as np
from pylre import LatentRuntimeEngine as LRE
import time

# Initialize LRE with CUDA Graphs enabled
lre = LRE(
    "model.onnx",
    execution_provider="tensorrt",
    enable_cuda_graph=True,
)

# Prepare fixed-size input
batch_size = 1
height, width = 640, 640
channels = 3

# Create sample input tensor
input_tensor = torch.randn(batch_size, channels, height, width, device="cuda")

print("Warming up CUDA Graph...")
# First inference captures the graph
output = lre(input_tensor)
print(f"Graph captured. Output shape: {output.shape}")

# Benchmark performance
num_iterations = 1000
torch.cuda.synchronize()

start_time = time.time()
for _ in range(num_iterations):
    output = lre(input_tensor)

torch.cuda.synchronize()
end_time = time.time()

avg_time = (end_time - start_time) / num_iterations * 1000  # Convert to ms
fps = 1000 / avg_time

print(f"Average inference time: {avg_time:.2f} ms")
print(f"Throughput: {fps:.2f} FPS")

Memory Management

Automatic Memory Allocation

CUDA Graphs in LatentRuntimeEngine automatically handle memory management:

# Memory is pre-allocated during graph capture
lre = LRE("model.onnx", enable_cuda_graph=True)

# Input data is copied to pre-allocated buffers
input_data = torch.randn(1, 3, 224, 224, device="cuda")
output = lre(input_data)  # Uses pre-allocated memory

# Check memory usage
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")

Memory Optimization Tips

# Monitor peak memory usage
torch.cuda.reset_peak_memory_stats()
output = lre(input_tensor)
peak_memory = torch.cuda.max_memory_allocated() / 1024**2
print(f"Peak memory usage: {peak_memory:.2f} MB")

# Clear unused memory if needed
torch.cuda.empty_cache()

CUDA Graphs vs CUDA Streams

Important: CUDA Graphs and CUDA Streams are mutually exclusive features in LatentRuntimeEngine:

CUDA Graphs: Use when you have fixed input shapes and want maximum performance for repeated inference
CUDA Streams: Use when you need parallel execution of multiple models or dynamic input shapes

# Choose one approach based on your use case

# Option 1: CUDA Graphs for fixed shapes and maximum performance
lre_graph = LRE("model.onnx", enable_cuda_graph=True)

# Option 2: CUDA Streams for parallel execution
stream = torch.cuda.Stream()
lre_stream = LRE("model.onnx", cuda_stream=stream.cuda_stream)

# Cannot combine both features
# This will NOT work:
# lre = LRE("model.onnx", enable_cuda_graph=True, cuda_stream=stream.cuda_stream)

Performance Expectations

Typical performance improvements with CUDA Graphs:

Small models: 10-30% improvement
Large models: 5-15% improvement
Batch inference: Higher improvements with larger batches
Repeated inference: Maximum benefit for repeated operations