Skip to main content

VRAM Calculator User Guide

This guide explains how to use the VRAM Calculator tool available on the Cordatus Platform.

VRAM Calculator is a powerful tool that helps you determine which LLM (Large Language Model) can run on which GPU with specific configurations. This tool calculates GPU VRAM requirements by considering factors such as model weights, KV Cache, activation memory, and system overhead, allowing you to plan your deployments in advance.

See details → Application Hub Overview | Application Hub Quickstart | Standard Application Launch Guide | NVIDIA AI Dynamo Guide | User Models Guide


What is VRAM Calculator?

VRAM Calculator is a tool that calculates the total GPU memory required to run an LLM model and visualizes how this memory is distributed.

VRAM Calculator Complete Guide

 

The calculations include the following components:

  • Model Weights: The memory space occupied by the model's parameters
  • KV Cache: Key-Value cache used in Transformer models
  • Overhead/Activation Memory: System overhead or activation memory
  • Free VRAM: Remaining available memory space

3. Accessing VRAM Calculator

You can access VRAM Calculator in two different ways:

  1. Standalone Mode:
    You can use it independently by selecting the VRAM Calculator option from the main menu.

  2. Device Mode:
    After connecting to a specific device, you can perform calculations using that device's GPUs.


4. Model Selection

4.1 Registered Models

VRAM Calculator lists LLM models that are predefined and tested in the Cordatus system.

  • Model Selection: Select the desired model from the dropdown menu
  • Once a model is selected, model parameters registered in the system (hidden_size, num_params, attention_heads, etc.) are automatically loaded

4.2 Adding New Models (Search Huggingface Model)

If your desired model is not registered in the system, you can use the Search Huggingface Model feature.

Supported Formats:

  1. Hugging Face URL:
    https://huggingface.co/meta-llama/Llama-2-7b-hf

  2. Model Name:
    meta-llama/Llama-2-7b-hf

Usage Steps:

  1. Paste the model name or URL into the Search Huggingface Model field
  2. Click the Find button
  3. If the model is found successfully:
    • The model is automatically added to the registered models list
    • Model parameters are fetched from the Hugging Face API
    • A success message is displayed (e.g., "Model 'meta-llama/Llama-2-7b-hf' found with 7B parameters")
  4. If the model is not found, an error message is displayed

💡 Note: With this feature, you can both contribute to the Cordatus system and add new models for use in VRAM Calculator.


5. GPU Selection

5.1 Standalone Mode

Select the GPU model from the dropdown menu. GPU models and VRAM capacities registered in the Cordatus system are listed:

  • NVIDIA A100 (80GB)
  • NVIDIA H100 (80GB)
  • NVIDIA V100 (32GB)
  • NVIDIA RTX 4090 (24GB)
  • and other GPU models

5.2 Device Mode

In device mode, the actual GPUs of the connected device are automatically detected and listed:

  • GPU name and VRAM capacity are displayed
  • If there are multiple GPUs, all are selected by default
  • You can customize by removing desired GPUs
  • The total VRAM capacity of selected GPUs is automatically calculated

⚙️ Important: In Device Mode, GPU count is automatically determined and the GPU Count setting is disabled.


6. Configuration Settings

After selecting a model and GPU, basic configurations are automatically defined and calculation begins. You can customize the following parameters:

6.1 GPU Count

⚠️ Note: This setting is only available in Standalone Mode. In Device Mode, the selected GPU count is automatically determined.

  • Description: The number of GPUs to be used to run the model
  • Default Value: 1
  • Usage: You can increase the total VRAM capacity by using multiple GPUs
  • Effect: Total System VRAM = GPU VRAM × GPU Count × GPU Memory Utilization

Example:

  • 1 × NVIDIA A100 (80GB) = 80GB total VRAM
  • 4 × NVIDIA A100 (80GB) = 320GB total VRAM

6.2 Quantization

  • Description: The format in which model weights are stored in memory

  • Default Value: BF16 (16-bit Brain Floating Point)

  • Options:

    • BF16 (16-bit): Highest accuracy, highest memory usage
    • FP16 (16-bit): High accuracy, high memory usage
    • FP8 (8-bit): Medium accuracy, medium memory usage
    • INT8 (8-bit): Good accuracy, lower memory usage
    • FP4 (4-bit): Low accuracy, minimum memory usage
    • INT4 (4-bit): Lowest accuracy, minimum memory usage
  • Effect: As the quantization value decreases:

    • Model weights occupy less space
    • You can run larger models on smaller GPUs
    • Model accuracy may decrease

Calculation:

Model Weight Size = (Num Parameters × Quantization Bits) / 8 / (1024³)

Example:

  • 7B parameter model
  • FP16 (16-bit): 7B × 16 / 8 = 14GB
  • INT8 (8-bit): 7B × 8 / 8 = 7GB
  • INT4 (4-bit): 7B × 4 / 8 = 3.5GB

6.3 Sequence Length

  • Description: The maximum number of tokens the model can process at once
  • Default Value: 1024
  • Range: 1024 - 262144 (slider with 9 different values: 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K)
  • Manual Entry: You can enter a number directly instead of using the slider
  • Effect:
    • Directly affects KV Cache size
    • Affects Activation Memory size
    • Longer sequence requires more VRAM

KV Cache Calculation:

KV Cache = (Batch Size × Sequence Length × Num Key-Value Heads × Head Dimension × 2 × Num Layers × Quantization Bytes) / GB

Activation Memory Calculation:

Activation = (Sequence Length × Batch Size × (18 × Hidden Size + 4 × Intermediate Size)) / GB

Example Use Cases:

  • 1024-2048: Short conversations, quick responses
  • 4096-8192: Medium-length texts, code generation
  • 16384-32768: Long documents, detailed analysis
  • 65536+: Very long texts, book analysis

6.4 Batch Size

  • Description: The number of requests to be processed in parallel simultaneously
  • Default Value: 1
  • Range: 1 - 512 (or more)
  • Effect:
    • Directly affects KV Cache size
    • Affects Activation Memory size
    • Higher batch size means more throughput but more VRAM

Example:

  • Batch Size = 1: Single user, low VRAM
  • Batch Size = 8: 8 parallel requests, medium VRAM
  • Batch Size = 32: High throughput, high VRAM

6.5 GPU Memory Utilization

  • Description: What percentage of the GPU's total VRAM will be used for the model
  • Default Value: 0.90 (90%)
  • Range: 0.0 - 1.0 (0% - 100%)
  • Slider Step: 0.01 (1%)
  • Why Not 100%?
    • GPU operating system and drivers require memory
    • Space is needed for CUDA kernels and other system operations
    • 90-95% is the optimal range

Effect:

Available System VRAM = GPU VRAM × GPU Count × GPU Memory Utilization

Example:

  • NVIDIA A100: 80GB × 0.90 = 72GB available VRAM
  • NVIDIA A100: 80GB × 1.0 = 80GB available VRAM (risky)

6.6 Calculation Type

Two different overhead calculation methods are offered:

Overhead (Simple Approach)

  • Adds overhead equal to 20% of model weights
  • Calculation: Overhead = Model Weights × 0.20
  • Usage: For quick and simple estimation
  • Advantage: Easy to understand
  • Disadvantage: Less precise

Activation (Detailed Approach)

  • Fixed 1.5GB + Activation Memory
  • Calculation: Overhead = 1.5 GB + Activation Memory
  • Activation Memory depends on model architecture and sequence/batch parameters
  • Usage: For more accurate calculation
  • Advantage: More accurate results
  • Disadvantage: More complex

💡 Recommendation: Use Activation mode for production environments.


7. Results Screen

7.1 Summary Statistics (Stats Grid)

A 3-column grid is displayed at the top:

  1. Model Weights

    • 🔵 Blue icon
    • Space occupied by model parameters
    • Displayed in GB
  2. KV Cache

    • 🟡 Orange icon
    • Key-Value cache size
    • Varies according to sequence length and batch size
  3. Overhead / Activation

    • 🟠 Amber icon
    • System overhead or activation memory
    • Varies according to selected calculation type
  4. Required VRAM

    • 🔴 Red icon
    • Total required VRAM
    • Sum of Weights + KV Cache + Overhead
    • Highlighted in red if insufficient
  5. Available VRAM

    • 🟢 Green icon
    • Total VRAM available in the system
    • GPU VRAM × GPU Count × Utilization
    • Highlighted in green if sufficient
  6. Free VRAM

    • 🔵 Blue-green icon
    • Remaining free space after model is loaded
    • Available - Required

7.2 Chart (Doughnut Chart)

A circular chart is displayed on the left:

Chart Components:

  • 🔵 Blue Slice: Model Weights
  • 🟠 Orange Slice: Overhead
  • 🟡 Yellow Slice: KV Cache
  • 🟢 Green Slice: Free VRAM

Chart States:

  • No Data: Gray "No Data" is displayed
  • Sufficient VRAM: Green slice is visible
  • Insufficient VRAM: Green slice is not visible, only used areas are shown

7.3 Detailed Metrics

A detailed metric list is displayed on the right:

  1. Model Weights: Size of model weights in GB
  2. Overhead: System overhead (according to calculation type)
  3. KV Cache: Key-Value cache size
  4. Activation Memory: Activation memory (in Activation mode)
  5. Free VRAM: Remaining free space (highlighted in green)

Usage Bar:

  • Ratio of required VRAM to total VRAM
  • Displayed as percentage
  • Below 100%: Purple color (✅ Sufficient)
  • Above 100%: Red color (❌ Insufficient)

7.4 Status Summary

A large status card is displayed at the bottom:

Sufficient VRAM

  • ✅ Green icon and frame
  • Title: "VRAM is Sufficient"
  • Description: Total required VRAM and available VRAM information
  • Color: Green tones

Example Message:

✅ VRAM is Sufficient
Your configuration requires 45.2 GB of VRAM,
and you have 72.0 GB available.

Insufficient VRAM

  • ❌ Red icon and frame
  • Title: "VRAM is Insufficient"
  • Description: Total required VRAM and available VRAM information
  • Suggestions: Increase GPU count or lower quantization level

Example Message:

❌ VRAM is Insufficient
Your configuration requires 95.8 GB of VRAM,
but you only have 72.0 GB available.