LLM GPU Memory Consumption Calculator

LLM GPU Memory Calculator








References & Credits

The GPU memory calculation strategies and insights into the challenges of training large language models (LLMs) are inspired by the lessons from Generative AI with Large Language Models course by deeplearning.ai. Specifically, the sections on "Optional video: Efficient multi-GPU compute strategies" and "Computational challenges of training LLMs" provided valuable information for developing this calculator.

Calculation Method, Logic, and Training Overhead Explained

The calculator estimates GPU memory required using the formula:

  • Storage Memory: Parameters * BytesPerParameter
  • Total Training Memory: Storage Memory * (1 + Training Overhead Multiplier)

Where the training overhead multiplier accounts for:

  • Two optimizer states (16 or 8 bytes, matching precision)
  • Gradients (matching precision bytes)
  • Activations and temporary variables (estimated at 5 times the model storage memory)

The training overhead multiplier is approximately 6, assuming 1 for model storage and 5 for the additional components, calculated per parameter in bytes:

  • FP32: 4 bytes. Full precision.
  • FP16/BFLOAT16: 2 bytes. Reduced memory with minimal precision loss.
  • INT8: 1 byte. Maximum memory efficiency with more significant precision loss.

Final formula for memory calculation: Memory (GB) = Parameters (billions) * BytesPerParameter * 6 / (1024^3)

Understanding Multi-GPU Compute Strategies

Distributed Data Parallel (DDP): This strategy replicates your model on each GPU, processing data in parallel and synchronizing the results to update the model identically across all GPUs. It's efficient for models that fit within a single GPU's memory, speeding up training by leveraging parallel computation.

Fully Sharded Data Parallel (FSDP): FSDP extends capabilities for larger models by sharding model states across GPUs, reducing memory redundancy. Inspired by the ZeRO optimization, it distributes model parameters, gradients, and optimizer states, enabling training of models too large for a single GPU.

Choosing between DDP and FSDP depends on your model's size and the memory capacity of your GPUs. FSDP allows for scaling to larger models by minimizing memory footprint, while DDP is simpler and efficient for models that easily fit in memory.