VRAM Calculator User Guide

This guide explains how to use the VRAM Calculator tool available on the Cordatus Platform.

VRAM Calculator is a powerful tool that helps you determine which LLM (Large Language Model) can run on which GPU with specific configurations. This tool calculates GPU VRAM requirements by considering factors such as model weights, KV Cache, activation memory, and system overhead, allowing you to plan your deployments in advance.

See details → Application Hub Overview | Application Hub Quickstart | Standard Application Launch Guide | NVIDIA AI Dynamo Guide | User Models Guide

What is VRAM Calculator?

VRAM Calculator is a tool that calculates the total GPU memory required to run an LLM model and visualizes how this memory is distributed.

VRAM Calculator Complete Guide

The calculations include the following components:

Model Weights: The memory space occupied by the model's parameters
KV Cache: Key-Value cache used in Transformer models
Overhead/Activation Memory: System overhead or activation memory
Free VRAM: Remaining available memory space

3. Accessing VRAM Calculator

You can access VRAM Calculator in two different ways:

Standalone Mode:
You can use it independently by selecting the VRAM Calculator option from the main menu.
Device Mode:
After connecting to a specific device, you can perform calculations using that device's GPUs.

4. Model Selection

4.1 Registered Models

VRAM Calculator lists LLM models that are predefined and tested in the Cordatus system.

Model Selection: Select the desired model from the dropdown menu
Once a model is selected, model parameters registered in the system (hidden_size, num_params, attention_heads, etc.) are automatically loaded

4.2 Adding New Models (Search Huggingface Model)

If your desired model is not registered in the system, you can use the Search Huggingface Model feature.

Supported Formats:

Hugging Face URL:
https://huggingface.co/meta-llama/Llama-2-7b-hf
Model Name:
meta-llama/Llama-2-7b-hf

Usage Steps:

Paste the model name or URL into the Search Huggingface Model field
Click the Find button
If the model is found successfully:
- The model is automatically added to the registered models list
- Model parameters are fetched from the Hugging Face API
- A success message is displayed (e.g., "Model 'meta-llama/Llama-2-7b-hf' found with 7B parameters")
If the model is not found, an error message is displayed

💡 Note: With this feature, you can both contribute to the Cordatus system and add new models for use in VRAM Calculator.

5. GPU Selection

5.1 Standalone Mode

Select the GPU model from the dropdown menu. GPU models and VRAM capacities registered in the Cordatus system are listed:

NVIDIA A100 (80GB)
NVIDIA H100 (80GB)
NVIDIA V100 (32GB)
NVIDIA RTX 4090 (24GB)
and other GPU models

5.2 Device Mode

In device mode, the actual GPUs of the connected device are automatically detected and listed:

GPU name and VRAM capacity are displayed
If there are multiple GPUs, all are selected by default
You can customize by removing desired GPUs
The total VRAM capacity of selected GPUs is automatically calculated

⚙️ Important: In Device Mode, GPU count is automatically determined and the GPU Count setting is disabled.

6. Configuration Settings

After selecting a model and GPU, basic configurations are automatically defined and calculation begins. You can customize the following parameters:

6.1 GPU Count

⚠️ Note: This setting is only available in Standalone Mode. In Device Mode, the selected GPU count is automatically determined.

Description: The number of GPUs to be used to run the model
Default Value: 1
Usage: You can increase the total VRAM capacity by using multiple GPUs
Effect: Total System VRAM = GPU VRAM × GPU Count × GPU Memory Utilization

Example:

1 × NVIDIA A100 (80GB) = 80GB total VRAM
4 × NVIDIA A100 (80GB) = 320GB total VRAM

6.2 Quantization

Description: The format in which model weights are stored in memory
Default Value: BF16 (16-bit Brain Floating Point)
Options:
- BF16 (16-bit): Highest accuracy, highest memory usage
- FP16 (16-bit): High accuracy, high memory usage
- FP8 (8-bit): Medium accuracy, medium memory usage
- INT8 (8-bit): Good accuracy, lower memory usage
- FP4 (4-bit): Low accuracy, minimum memory usage
- INT4 (4-bit): Lowest accuracy, minimum memory usage
Effect: As the quantization value decreases:
- Model weights occupy less space
- You can run larger models on smaller GPUs
- Model accuracy may decrease

Calculation:

Model Weight Size = (Num Parameters × Quantization Bits) / 8 / (1024³)

Example:

7B parameter model
FP16 (16-bit): 7B × 16 / 8 = 14GB
INT8 (8-bit): 7B × 8 / 8 = 7GB
INT4 (4-bit): 7B × 4 / 8 = 3.5GB

6.3 Sequence Length

Description: The maximum number of tokens the model can process at once
Default Value: 1024
Range: 1024 - 262144 (slider with 9 different values: 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K)
Manual Entry: You can enter a number directly instead of using the slider
Effect:
- Directly affects KV Cache size
- Affects Activation Memory size
- Longer sequence requires more VRAM

KV Cache Calculation:

KV Cache = (Batch Size × Sequence Length × Num Key-Value Heads × Head Dimension × 2 × Num Layers × Quantization Bytes) / GB

Activation Memory Calculation:

Activation = (Sequence Length × Batch Size × (18 × Hidden Size + 4 × Intermediate Size)) / GB

Example Use Cases:

1024-2048: Short conversations, quick responses
4096-8192: Medium-length texts, code generation
16384-32768: Long documents, detailed analysis
65536+: Very long texts, book analysis

6.4 Batch Size

Description: The number of requests to be processed in parallel simultaneously
Default Value: 1
Range: 1 - 512 (or more)
Effect:
- Directly affects KV Cache size
- Affects Activation Memory size
- Higher batch size means more throughput but more VRAM

Example:

Batch Size = 1: Single user, low VRAM
Batch Size = 8: 8 parallel requests, medium VRAM
Batch Size = 32: High throughput, high VRAM

6.5 GPU Memory Utilization

Description: What percentage of the GPU's total VRAM will be used for the model
Default Value: 0.90 (90%)
Range: 0.0 - 1.0 (0% - 100%)
Slider Step: 0.01 (1%)
Why Not 100%?
- GPU operating system and drivers require memory
- Space is needed for CUDA kernels and other system operations
- 90-95% is the optimal range

Effect:

Available System VRAM = GPU VRAM × GPU Count × GPU Memory Utilization

Example:

NVIDIA A100: 80GB × 0.90 = 72GB available VRAM
NVIDIA A100: 80GB × 1.0 = 80GB available VRAM (risky)

6.6 Calculation Type

Two different overhead calculation methods are offered:

Overhead (Simple Approach)

Adds overhead equal to 20% of model weights
Calculation: Overhead = Model Weights × 0.20
Usage: For quick and simple estimation
Advantage: Easy to understand
Disadvantage: Less precise

Activation (Detailed Approach)

Fixed 1.5GB + Activation Memory
Calculation: Overhead = 1.5 GB + Activation Memory
Activation Memory depends on model architecture and sequence/batch parameters
Usage: For more accurate calculation
Advantage: More accurate results
Disadvantage: More complex

💡 Recommendation: Use Activation mode for production environments.

7. Results Screen

7.1 Summary Statistics (Stats Grid)

A 3-column grid is displayed at the top:

Model Weights
- 🔵 Blue icon
- Space occupied by model parameters
- Displayed in GB
KV Cache
- 🟡 Orange icon
- Key-Value cache size
- Varies according to sequence length and batch size
Overhead / Activation
- 🟠 Amber icon
- System overhead or activation memory
- Varies according to selected calculation type
Required VRAM
- 🔴 Red icon
- Total required VRAM
- Sum of Weights + KV Cache + Overhead
- Highlighted in red if insufficient
Available VRAM
- 🟢 Green icon
- Total VRAM available in the system
- GPU VRAM × GPU Count × Utilization
- Highlighted in green if sufficient
Free VRAM
- 🔵 Blue-green icon
- Remaining free space after model is loaded
- Available - Required

7.2 Chart (Doughnut Chart)

A circular chart is displayed on the left:

Chart Components:

🔵 Blue Slice: Model Weights
🟠 Orange Slice: Overhead
🟡 Yellow Slice: KV Cache
🟢 Green Slice: Free VRAM

Chart States:

No Data: Gray "No Data" is displayed
Sufficient VRAM: Green slice is visible
Insufficient VRAM: Green slice is not visible, only used areas are shown

7.3 Detailed Metrics

A detailed metric list is displayed on the right:

Model Weights: Size of model weights in GB
Overhead: System overhead (according to calculation type)
KV Cache: Key-Value cache size
Activation Memory: Activation memory (in Activation mode)
Free VRAM: Remaining free space (highlighted in green)

Usage Bar:

Ratio of required VRAM to total VRAM
Displayed as percentage
Below 100%: Purple color (✅ Sufficient)
Above 100%: Red color (❌ Insufficient)

7.4 Status Summary

A large status card is displayed at the bottom:

Sufficient VRAM

✅ Green icon and frame
Title: "VRAM is Sufficient"
Description: Total required VRAM and available VRAM information
Color: Green tones

Example Message:

✅ VRAM is Sufficient
Your configuration requires 45.2 GB of VRAM, 
and you have 72.0 GB available.

Insufficient VRAM

❌ Red icon and frame
Title: "VRAM is Insufficient"
Description: Total required VRAM and available VRAM information
Suggestions: Increase GPU count or lower quantization level

Example Message:

❌ VRAM is Insufficient
Your configuration requires 95.8 GB of VRAM, 
but you only have 72.0 GB available.

What is VRAM Calculator?​

3. Accessing VRAM Calculator​

4. Model Selection​

4.1 Registered Models​

4.2 Adding New Models (Search Huggingface Model)​

5. GPU Selection​

5.1 Standalone Mode​

5.2 Device Mode​

6. Configuration Settings​

6.1 GPU Count​

6.2 Quantization​

6.3 Sequence Length​

6.4 Batch Size​

6.5 GPU Memory Utilization​

6.6 Calculation Type​

Overhead (Simple Approach)​

Activation (Detailed Approach)​

7. Results Screen​

7.1 Summary Statistics (Stats Grid)​

7.2 Chart (Doughnut Chart)​

7.3 Detailed Metrics​

7.4 Status Summary​

Sufficient VRAM​

Insufficient VRAM​

What is VRAM Calculator?

3. Accessing VRAM Calculator

4. Model Selection

4.1 Registered Models

4.2 Adding New Models (Search Huggingface Model)

5. GPU Selection

5.1 Standalone Mode

5.2 Device Mode

6. Configuration Settings

6.1 GPU Count

6.2 Quantization

6.3 Sequence Length

6.4 Batch Size

6.5 GPU Memory Utilization

6.6 Calculation Type

Overhead (Simple Approach)

Activation (Detailed Approach)

7. Results Screen

7.1 Summary Statistics (Stats Grid)

7.2 Chart (Doughnut Chart)

7.3 Detailed Metrics

7.4 Status Summary

Sufficient VRAM

Insufficient VRAM