A Friday evening, at 5:21pm
On 12 June 2026, at 5:21pm New York time, Anthropic receives a letter from the U.S. Department of Commerce. It is an export control directive, issued under the Export Controls Reform Act of 2018. The content is blunt: suspend all access to Fable 5 and Mythos 5 — the company's two most powerful models, launched only a few days earlier — by any foreign national, inside or outside the United States, including Anthropic's own non-U.S. employees. Unable to separate foreign users from American ones in time, Anthropic does the only thing it can to stay compliant: it switches both models off for everyone, worldwide, within a few hours.
It is not a bug. It is not a commercial decision. It is a geopolitical decision taken elsewhere, over which customers had neither a voice nor any warning. It is also the first time the United States has used export controls not on chips, but directly on a model.
Let's stop and read that line carefully: the block targets foreign nationals. If you are a European company — like us, like many of our clients — you are the foreign national. You are exactly the category that was cut off overnight. If you had built a product, a workflow or a pipeline on that model, you would have found it switched off on Saturday morning, because of a letter that landed in an office on the other side of the ocean.
It is the clearest possible demonstration of what it means to build on someone else's model, behind someone else's border. And it makes a question that, until yesterday, felt a little academic suddenly very concrete: what does it mean to put an LLM “in-house”?
The answer is simple in principle: when you run an open model on your own hardware, the weights are on your disk, inference runs on your iron, and no directive, from any government, can switch it off with a letter. It is not a question of the model's nationality — GPT-OSS is just as American as Fable — but of operational control: a model you own cannot be revoked remotely by anyone, and your data never leaves your perimeter. For those working with regulated clients (DORA, NIS2), this stops being a nicety and becomes a requirement.
The good news is that today's open models are genuinely capable. In this article we will use them as concrete examples — OpenAI's open models GPT-OSS 20B and 120B (Apache 2.0 license) and Google's open family Gemma (not to be confused with Gemini, which is closed and cannot be put in-house) — to show exactly what runs on what. We'll think across the full spectrum: from the Macs many developers already have on their desks, all the way to enterprise 8-GPU nodes.
The two numbers that decide everything (plus a third)
Before talking about brands and models, you need to internalize two quantities. Everything else follows from them.
1. Memory capacity (VRAM): it decides whether the model fits. A model occupies memory equal to its parameters multiplied by the bytes for each one: 2 bytes in FP16/BF16, 1 byte in FP8, half a byte in INT4/FP4. A 70-billion-parameter model therefore wants ~140 GB in FP16, ~70 GB in FP8, ~35 GB in INT4. On top of this you have to add the KV cache, which grows with the context length. If the model plus its cache don't fit in memory, it simply won't start.
2. Memory bandwidth: it decides how fast it generates. Token generation, one after another, is limited by the speed at which the GPU reads the weights from memory (it is memory-bound). More bandwidth, more tokens per second. This is why a card with HBM at 3 TB/s “thrashes” an APU with unified memory at 250 GB/s even though both can hold the same model.
Then there's a detail that changes the game with modern models: Mixture-of-Experts (MoE). GPT-OSS 120B has ~117 billion total parameters but activates only ~5 per token; GPT-OSS 20B has ~21 total and ~3.6 active. The VRAM is taken up by all of them (the weights must be kept in memory); the speed, on the other hand, depends only on the active ones. The result: these models are large to hold but fast to run — GPT-OSS 20B on a 24 GB card runs at ~136 tokens/s, at the speed of a 7B model despite being three times its size. That is precisely what makes them interesting on “lightweight” hardware.
3. Concurrency: it decides whether you are serving one person or a team. It is the most underrated axis. Running a model for yourself is one thing; serving it to 10 developers at the same time is another. True concurrency requires a serving engine like vLLM (with continuous batching and PagedAttention), which runs on datacenter NVIDIA/AMD GPUs but not on Macs and only partially on APUs. This is the dividing line between a “developer machine” and a “server”.
Let's keep these three axes in mind — does it fit / how fast / for how many — and walk through the spectrum.
Level 1 — Apple Silicon: Mac mini, MacBook Pro, Mac Studio
The surprise, for anyone coming from the x86 world, is that a Mac is an excellent device for LLMs. The credit goes to the unified memory architecture: CPU and GPU share a single large pool of RAM. On a traditional PC the GPU is bound to the few GB of its dedicated VRAM; on a Mac you can devote tens of GB of system memory to the model. On top of that the software stack is mature and smooth — llama.cpp with the Metal backend, Apple's MLX framework, LM Studio, Ollama — often more painless than ROCm on AMD.
The discriminating factor between the various Macs is bandwidth, which rises with the chip tier:
- Mac mini (M4 / M4 Pro) — up to 64 GB of unified memory, but limited bandwidth (~120 GB/s on the base M4, ~273 GB/s on the M4 Pro). It's the machine for tests and experiments, or for a developer with small models. Here GPT-OSS 20B (~14 GB, o3-mini-class quality) is the perfect example: it fits comfortably, it's fast, it has good tool-calling. Gemma 3 12B and Qwen3 7B-14B also do well.
- MacBook Pro (M4/M5 Max) — up to 128 GB, ~546 GB/s. Here you move up to quantized 30B models: Gemma 3 27B, Gemma 4, lightweight 30B MoEs. With the model's mobility in your backpack.
- Mac Studio — the desktop. The M4 Max delivers 546 GB/s; the M3 Ultra reaches 819 GB/s and is Apple's fastest machine for single-user inference. On a 96 GB M3 Ultra you can even fit GPT-OSS 120B quantized (the MXFP4 checkpoint is ~61 GB). (A timely note: because of the global DRAM shortage, by mid-2026 the high memory configurations of the Mac Studio have become hard to order — the configurator tends to stop at the 96 GB of the M3 Ultra. Always check availability and delivery times.)
The limit, however, is clear-cut and must be stated: no vLLM on macOS. The Mac is a single-user machine (or for very few). It is perfect for testing, prototyping, or giving a developer a quality local model — not for serving a team under concurrency.
Level 2 — AMD Ryzen AI Max+ 395 (Strix Halo): unified memory on x86
The PC world's answer to Apple's idea. The Strix Halo is an APU with up to 128 GB of unified LPDDR5X memory, of which ~96 GB can be allocated as VRAM, an integrated Radeon 8060S GPU and an NPU. You find it in mini-PCs such as the Framework Desktop, GMKtec EVO-X2, MINISFORUM, at prices around 2,000-3,000 euros (the price window has risen quite a bit with the RAM crisis) and with very low power draw, ~130 W.
The appeal is obvious: 128 GB of “VRAM” at that price doesn't exist anywhere else. But there's the other side of the coin, and it's precisely the bandwidth: ~256 GB/s theoretical, ~215 measured. That is about a quarter of an RTX 4090 and a fifteenth of an H100. And since generation is memory-bound, that bandwidth is the ceiling on speed.
Translated into real numbers, on current builds with llama.cpp/Vulkan:
- lightweight ~30B MoE models (e.g. Qwen3 30B-A3B): 70-100 tokens/s — very usable;
- GPT-OSS 120B: ~53 tokens/s — remarkable, for a 117B model on a two-thousand-euro mini-PC, and it's thanks to the mere ~5B active parameters;
- Qwen3-Coder-Next 80B-A3B in Q4: ~42 tokens/s — usable for a single developer;
- giant MoEs such as 235B: ~11 tokens/s — it runs, but slowly;
- dense 70B models: ~5 tokens/s — here the bandwidth sinks everything.
You can clearly see the MoE principle at work: models with few active parameters fly, dense ones crawl. The Strix Halo is therefore an excellent on-prem developer machine — cheap, low-power, with the data never leaving the desk — provided you accept that, exactly like the Mac, it is a single-user device: a programmer's machine, not a server.
Level 3 — The enterprise leap: NVIDIA GPUs
Here you change category. You get two things that neither Macs nor APUs can give: bandwidth in the TB/s (HBM or GDDR7) and, above all, vLLM and true concurrency. An NVIDIA GPU in a Linux server genuinely serves an entire team. Let's look at the relevant options, with costs, power draw and — the point almost everyone forgets — the power supplies required.
NVIDIA RTX PRO 6000 Blackwell (96 GB) — it's the sweet spot for most companies. 96 GB of GDDR7 at ~1.79 TB/s, native FP8 and FP4 support, on a single PCIe card. A 70-80B model fits entirely with room for the KV cache, and GPT-OSS 120B runs cleanly on a single card (something impossible on a 24 GB consumer card). It exists in Workstation, Max-Q and Server Edition (passive, for racks) versions. Price around 8,500 dollars. The price to pay, literally, is the power draw: 600 W (the Max-Q drops to 300 W). A single one of these cards, added to the rest of the system, wants a robust 1,200-1,500 W power supply with headroom, and serious cooling. It has no NVLink: multiple cards scale as replicas, not as a memory pool.
RTX A6000 / RTX 6000 Ada (48 GB) — the previous generation, still perfectly valid. 48 GB per card, 300 W, cheaper and with mature drivers (zero early-adopter headaches). For a 70-80B model, or for GPT-OSS 120B, you need two of them (in tensor-parallel). A useful tidbit: the old A6000 (Ampere) has NVLink and the pair links up nicely; the 6000 Ada doesn't, but in exchange it has FP8. They are an excellent platform for a low-risk trial.
NVIDIA H100 (80 GB) — the datacenter card par excellence. 80 GB of HBM3, bandwidth up to ~3.35 TB/s (SXM version) or ~2 TB/s (PCIe), NVLink on the SXM. It is the natural “home” of GPT-OSS 120B, which fits cleanly at full speed. Power draw 350 W (PCIe) - 700 W (SXM), price in the order of 25,000-30,000 euros per card. SXM systems with 4-8 GPUs require multi-kW power and often liquid cooling.
NVIDIA H200 (141 GB) — the evolution: 141 GB of HBM3e at ~4.8 TB/s. More memory and more bandwidth than the H100, it is today one of the best cards for inference on large models. Cost and power draw higher still (~700 W, 30,000+ euros).
The electrical topic deserves an extra line, because it is a real physical constraint. A 600 W card is manageable; a server with 4 GPUs at 600 W means 2,400 W of GPUs alone, over 3-4 kW with the rest of the system — more than a normal domestic/office 16 A socket delivers (~3.5 kW). At these levels you need redundant 2,000+ W power supplies, dedicated circuits and a real cooling plan. The hardware is only half the problem: the other half is where you put it.
Level 4 — Clusters: the interconnect wall
A natural question: if a card is expensive and has little memory, why not network together lots of cheap machines — several APUs, or PCs with an RTX 5090 of 32 GB each — and add up their memory to run an enormous model?
Technically you can. In practice you pay a very harsh price, and the reason is just one: the network between machines is orders of magnitude slower than memory. A GPU's memory runs at TB/s; the link between two PCs, even with a good USB4/Thunderbolt or 10GbE, gives around ~10 Gbps real — a gap of hundreds of times. When you split a model across multiple networked nodes, every token has to bounce data between machines through that bottleneck. Tools like llama.cpp RPC, on large models, fall into “round robin” mode: they pass the processing from one node to another in sequence instead of parallelizing, and the speed collapses. Software for AI clustering, today, is not yet mature for production.
The practical rule that follows is important: for inference, scale “vertically” before “horizontally”. More GPUs in the same computer, linked to each other via PCIe Gen5 (~64 GB/s) or, better, via NVLink (hundreds of GB/s), scale well: tensor-parallel works and performance is solid. The same GPUs scattered across different machines and linked via Ethernet give severe degradation. An RTX 5090 (32 GB, 1.79 TB/s, 575 W, ~2,000 euros list — often much more due to the GDDR7 shortage) is a very fast and cheap card: but the right way to use several of them is to put 2-4 in a single server, not to build a networked PC farm. The multi-node cluster only makes sense as a last resort to hold a model that fits nowhere else, accepting that it will be slow.
Level 5 — The summit: AMD Instinct MI300X and the like
At the very top are AMD's datacenter accelerators. The Instinct MI300X brings 192 GB of HBM3 per GPU (the MI325X reaches 256 GB, the MI355X 288 GB), with bandwidth above 5 TB/s and, by now, excellent software support via ROCm 7 and vLLM. They are capacity monsters: a single MI300X holds in memory a model that would require multiple H100s.
There is, however, a structural constraint that keeps them out of reach for most: they are bought only as 8-GPU platforms (OAM form factor on a Universal Base Board). A node is therefore a system with 1.5 TB+ of HBM, dual EPYC, mandatory liquid cooling on the most aggressive configurations, 6 kW and beyond of power draw, and a price that starts at 150,000 euros and comfortably goes past 300,000. Every integrator (Dell, HPE, Supermicro, Lenovo, GIGABYTE) builds them at 8x, and the costs are, indeed, stratospheric. It only makes sense for those who want to host frontier models in-house (the DeepSeeks or GLMs of hundreds of billions of parameters) or serve thousands of users. For everyone else, it is oversized by orders of magnitude.
Tying it together: which level for which need
There is no “right” hardware in absolute terms: there is the one that's right for your workload. To summarize:
| Level | Typical hardware | Memory | Bandwidth | Realistic models | Concurrency | Indicative cost | Who it's for |
|---|---|---|---|---|---|---|---|
| Light test / dev | Mac mini, Ryzen AI Max+ 395 | 32-128 GB | 120-256 GB/s | GPT-OSS 20B, Gemma 3, lightweight MoEs | Single-user | 0.6-3 k€ | Tests, prototypes, a dev with small models |
| Quality dev | MacBook Pro / Mac Studio M3 Ultra | 96-128 GB | 546-819 GB/s | Gemma 3 27B, GPT-OSS 120B quantized | Single-user | 3-5 k€ | A developer with mid-size models, mobility |
| SME server (sweet spot) | 1-2× RTX PRO 6000 Blackwell | 96-192 GB | ~1.8 TB/s | GPT-OSS 120B, 70-80B in FP8, high concurrency | Team (vLLM) | 8-18 k€ | Serving a team on-prem |
| Serious serving / training | H100 / H200 | 80-141 GB | 3.3-4.8 TB/s | large models, training | Large team | 25-40 k€/GPU | Intensive workloads, SLAs |
| Frontier / scale | 8× AMD MI300X (or similar) | 1.5 TB+/node | 5+ TB/s | frontier models, multi-tenant | Hundreds+ | 150-300 k€+ | Hosting enormous models in-house |
The thread running through it all remains that of the three initial numbers. VRAM decides whether the model starts. Bandwidth decides how fast it generates. And vLLM on real GPUs decides whether you are serving one person or a team.
But above everything there is the lesson of 12 June. A model behind an API is not really yours: it can be switched off by a decision over which you have no control, in an afternoon, without warning. An open model on your own hardware cannot. For many companies — especially those working with regulated clients — the most sensible solution is neither the desk toy nor the 300,000-euro node, but that middle step: one or two 96 GB enterprise GPUs in an on-prem server, running a good open model for the whole team, without a single line of code leaving the company perimeter, and without anyone, anywhere, being able to pull the plug.
Luca Vitali
Want to put an LLM “in-house”, without the data leaving?
We design and run on-prem AI infrastructure on European cloud and dedicated hardware, with open models in the right place. If you want to assess what your workload really needs, let's talk.
Discover our AI consulting