GPU / VRAM Calculator

Estimate VRAM for LLM. Factors: model weights, KV-cache, activations, quantization, batch size and users.

Model & Architecture

Model Preset

Total Params (B)

Active Params (MoE)

Layers

Hidden Dim

Intermediate (FFN)

Query Heads

KV Heads

Head Dim

Weight Quantization

KV Cache Precision

Context Length

Batch Size

Modality

Number of Concurrent Users

Avg. Tokens per Request

Safety Margin

Model Weights –

KV Cache (1 req) –

Activation Memory –

Framework Overhead –

Modality Overhead –

Shared Cost –

Per-User Cost –

Total — 1 User–

– shared + – per user

Shared (weights + overhead) –

Per-User × 1 users –

Modality × Safety Margin –

Total — Server VRAM –

Active Batch (parallel) –

Users Queued –

Gen Speed (single stream) –

Total Server Throughput –

Per-User Speed –

Time to First Token (est.) –

Sized for 1 user.

GPU	VRAM	BW	# GPUs	Max Users	Gen TPS	Per-User TPS	Fit	Cost	Action