VRAM Calculator — How Much Video Memory Do You Need for LLM

How to Calculate VRAM?

The amount of video memory needed to run a language model depends on the number of parameters and computation precision:

FP16 (16-bit) — 2 bytes per parameter. Full precision, maximum quality.

INT8 (8-bit) — 1 byte per parameter. Minimal quality loss, half the memory.

INT4 (4-bit) — 0.5 bytes per parameter. Noticeable quality loss, but the model fits on budget GPUs.

Formula: VRAM (GB) = parameters × bytes_per_parameter × 1.15 / 10⁹, where 1.15 is the overhead factor for KV cache, activations and framework.

How Much VRAM Do You Need?

The formula above shows the minimum VRAM needed to load model weights. In practice, video memory is used not only by the LLM — and this is important to consider when planning.

KV Cache and Batching. When processing long prompts or multiple requests simultaneously, the KV cache can consume 1 to 10+ GB on top of model weights. The longer the context and the more concurrent requests, the more additional memory is needed.

Remote Desktop (RDP/VNC). If you connect to a server via RDP, the video driver reserves some VRAM for desktop rendering — typically 200-500 MB, but at high resolutions with multiple monitors it can reach 1-2 GB. If the LLM is already using nearly all memory, the RDP connection may get a black screen and the model may throw an OOM error.

Other GPU Processes. System desktop, browser, video player, ComfyUI, Jupyter — all of these consume VRAM. On a dedicated server without GUI this is not an issue, but on a workstation you should reserve an extra 1-2 GB.

Recommendation: choose a GPU with at least 15-20% VRAM headroom above the calculated minimum. For production inference on a dedicated server without GUI, 10% is sufficient.

VRAM Calculator for AI Models

Calculate how much VRAM you need to run a neural network