tgroenwals shared this post · 2h ago
plutos

Everyone Is Buying the $1,499 "Nvidia Killer." Here's the Number Nobody Screenshotted.

You saw the thread. AMD CEO Lisa Su held a lunchbox-sized PC on stage, the demo ran a 235-billion-parameter model, and the box “beat an RTX 5080 by 3x.” So you did the $5,280-a-year subscription math and almost bought the $1,499 box.
The $1,499 box can’t run that model. And the 3x benchmark isn’t a speed win.
Here’s the box that actually runs 235B, the one number the thread left out, the Linux config that unlocks the memory, the exact way to point Claude Code at it, and the four workloads where this thing beats your cloud bill — plus the one where it falls flat. Every price, bandwidth, and token-rate figure below has been re-checked against primary sources; the verification notes are at the end.
https://pbs.twimg.com/media/HKzwtZnXsAAWU5J.png

The Benchmark That Broke The Room Is A Capacity Test

The viral line is “the Ryzen AI Max+ 395 beat an RTX 5080 by more than 3x on DeepSeek R1 inference.” It is a real number from a real test. It also tells you almost nothing about speed.
https://pbs.twimg.com/media/HKzwxQaWoAEDn7H.png
Call the failure mode the capacity-vs-speed swap — a benchmark wins on whether a model fits, then gets reported as if it won on how fast it runs. Those are different races, and AMD’s own materials say as much: the headline figure is “up to 3.05x” and it only appears once the model exceeds the 16GB the RTX 5080 has on board.
The mechanism: DeepSeek R1 at full size does not fit in the RTX 5080’s 16GB of VRAM. When a model overflows a discrete GPU’s memory, the extra layers spill into system RAM and crawl across the PCIe bus, which is dramatically slower than the card’s own memory. The 5080 didn’t lose because it’s slow — it lost because it was swapping weights through a straw. The AMD box “won” because all 128GB sits in one pool and nothing had to spill.
Flip to a model that does fit in 16GB and the result inverts. The 5080 pulls far ahead, because the number that governs token generation is memory bandwidth — and the gap there is not close.
https://pbs.twimg.com/media/HKzw34nXsAAMtIc.png
*256 GB/s is the theoretical peak of Strix Halo’s 256-bit LPDDR5X-8000 bus; real-world measured bandwidth lands around 210–220 GB/s. The point stands: this box moves data at roughly a fifth of a 5090’s rate and about a quarter of a 4090’s. During generation the GPU streams the model’s weights out of memory on every step. Less bandwidth, fewer tokens per second. There is no software trick around it.
https://pbs.twimg.com/media/HKzxClVWMAAPqKT.png
The mistake to avoid: quoting the “3x faster than a 5080” line to justify the purchase. It’s a capacity result wearing a speed costume. The honest version is narrower and more useful — this box runs models a 5080 physically cannot load, and runs them slowly.

You’re About To Buy The Wrong Box

The thread says $1,499. The photo shows a box running a 235-billion-parameter model. Those are not the same box.
The failure here has a name: parameter-count goggles. You fixate on “235B!” and the lowest visible price, and ignore the two numbers that actually decide your experience — how much memory the specific SKU has, and how fast that memory is.
It’s just product tiering. The $1,499 GMKtec EVO-X2 is the 64GB / 1TB configuration. A 235-billion-parameter model — even a Mixture-of-Experts one — does not fit in 64GB. Neither does a dense 70B at a sane quantization. The box that ran the demo is the 128GB version, and that one is not $1,499.
https://pbs.twimg.com/media/HKzxJ74XYAAwXWg.png
Same chip, brand-name tax
The 128GB EVO-X2 lands between roughly $2,199 and $2,299 depending on the seller — Tom’s Hardware logged the GMKtec-direct 128GB/2TB unit at $2,199, Micro Center lists it at $2,299, and GMKtec promo pricing has occasionally dipped to $1,999. The earlier “$1,999” figure circulating online was a promo price, not the standing one; budget ~$2,200.
And the “$4,000 Nvidia box” the thread mocks is NVIDIA’s DGX Spark. AMD now sells its ownfirst-party answer to it — the Ryzen AI Halo developer PC, sold through Micro Center, which opened at $3,999 with the identical Strix Halo silicon and 128GB. Same chip as the $2,200 third-party boxes, roughly $1,800 more, a logo and a developer-program bundle. So if your goal is the thing in the photo — a single box that loads a 200B-class model — your real number is about $2,200, not $1,499, and there is no reason to pay AMD’s $3,999.
https://pbs.twimg.com/media/HKzxOZgWMAAiMbE.png
The step that saves you a return: decide your largest target model first, then buy the SKU that holds it. Want 70B? 96GB minimum. Want 235B MoE? 128GB only. Buying the $1,499 box to run 235B is a guaranteed return.

The Memory Trick Nobody Explains (And The Part That Changed)

“On Linux you get ~110GB of usable VRAM” is true. It also does not happen by itself, it never happens to that degree on Windows, and the exact steps have shifted since the early guides — so here’s the current version.
https://pbs.twimg.com/media/HKzxTJCWIAE2Cj2.png
The trap: you boot the machine, open LM Studio, try to load a 70B, and it errors out claiming you have 16GB — or even 512MB — of graphics memory. You paid for 128GB. Where did it go?
On a unified-memory APU there is no separate VRAM chip. The system carves a slice of the shared pool for the GPU. Two things govern how big the GPU’s working pool can get: a BIOS carve-out (AMD calls the convertible portion Variable Graphics Memory, capped at 96GB officially) and, on Linux, the kernel’s GTTlimit (Graphics Translation Table — how the GPU addresses ordinary system memory). Windows tops out at AMD’s 96GB VGM allocation. Linux, via the GTT path, lets the GPU reach ~110GB and beyond. That headroom is the whole reason serious users run these boxes on Ubuntu/Fedora, not Windows.
What changed: the early guides told you to set VGM/UMA to its maximum. For Linux LLM use the current consensus is the opposite — set the BIOS carve-out to its minimum (512MB) and let the kernel’s GTT limit hand the GPU the rest of the pool dynamically. GTT-backed allocations aren’t permanently reserved, so the OS reclaims what the GPU isn’t using, and benchmarks show no speed penalty versus a large fixed carve-out. Also note the amdgpu.gttsize parameter is being deprecated in favour of the TTM page limit, and you want a recent kernel.

# 1. BIOS: set the GPU carve-out LOW, not high.
# Advanced -> AMD CBS -> NBIO Common Options
# Set "UMA Frame Buffer" / VGM to its MINIMUM (512MB) for the
# Linux+GTT path. (On Windows, set VGM high instead, max 96GB.)
# While here: set IOMMU to Disabled (small bandwidth win).
# 2. Use a recent kernel. 6.16.9+ fixes the bug where ROCm only
# sees ~15.5GB despite a correct allocation.
uname -r # want >= 6.16.9
# 3. Hand the GPU the pool via the kernel command line.
sudo nano /etc/default/grub
# Add to GRUB_CMDLINE_LINUX_DEFAULT. ttm.pages_limit is in PAGES
# (4KB each): 28160000 pages ~= 110GB. Prefer ttm.* over gttsize,
# which is being deprecated; setting both is harmless on older kernels.
# amd_iommu=off ttm.pages_limit=28160000 amdgpu.gttsize=110000
# 4. Rebuild grub and reboot.
sudo update-grub && sudo reboot
# 5. Verify GTT (NOT vram) on a unified-memory APU.
cat /sys/class/drm/card*/device/mem_info_gtt_total # bytes; want >100GB
cat /sys/module/ttm/parameters/pages_limit # pages

On a unified APU, rocm-smiwill report tiny VRAM (~1GB) by design — the real compute pool lives in GTT, so check mem_info_gtt_total. Get this wrong and the box behaves like a $1,499 disappointment. Get it right and the 70B you couldn’t load anywhere else snaps into memory.
The metric to watch: after reboot, GTT total should read north of 100GB. If it still shows a few GB, your BIOS carve-out or kernel line didn’t take — fix that before you blame the model.

Pick The Model The Bandwidth Likes

With ~110GB you can load almost anything. Bandwidth decides whether you’ll enjoyit.
https://pbs.twimg.com/media/HKzxhPOXYAEQZLU.png
The detail the thread skips: the 235B model everyone screenshotted is Qwen3-235B-A22B, a Mixture-of-Experts model. It has 235 billion total parameters but activates only about 22 billion on any given token (8 of 128 experts). The chip only streams the active experts through that ~256 GB/s pipe, not the whole 235B. That’s the only reason a box this slow can touch a model this big.
So the rule: at ~256 GB/s, MoE models punch far above their weight, and dense models hit a wall proportional to their size.
https://pbs.twimg.com/media/HKzxkRwWMAAsxyV.png
Runs at all, which is the point
The ~11 tok/s for Qwen3-235B isn’t a guess — it’s GMKtec’s own published figure for the 128GB EVO-X2. That’s roughly comfortable reading speed: fine for a conversation, slow for agentic coding, where a tool like Claude Code fires many model calls per task and each waits on the last. The 30B-A3B numbers, by contrast, have improvedsince the early reviews — current llama.cpp Vulkan/ROCm builds push it to 70–100 tok/s, well past the ~50 quoted a year ago. Dense 70B is the opposite story: at Q4 it’s ~40GB of weights streamed every token, so it’s firmly memory-bound in the single digits.
And none of this counts prefill — processing a long prompt before the first token appears. Prompt processing is genuinely slow on this hardware; feeding a 100K-token context takes real seconds-to-minutes, not the instant you’re used to from the cloud.
The mistake to avoid: grabbing a stock instruct model off the shelf and wiring it into a coding agent. It’ll look fine in a chat box, then silently fail the moment the agent tries to edit a file, because it was never trained for tool calls. For agent work pull a tool-capable model — Qwen3-Coder, GLM-4.x — and nothing less.

Point Claude Code At It — The Right Way

https://pbs.twimg.com/media/HKzxwo3XsAANPQR.png
“Point Claude Code at localhost” used to require a proxy. As of 2026 it’s native — but only because of a specific change, and only if you do a few things right.
The history matters. Claude Code speaks Anthropic’s Messages API (/v1/messages). Ollama and LM Studio historically spoke OpenAI’s chat-completions format, so people ran translation proxies like LiteLLM. That changed on January 16, 2026, when Ollama v0.14 shipped a native Anthropic-compatible endpoint. LM Studio followed exactly two weeks later, on January 30, with the same endpoint in version 0.4.1. The proxy is now optional for most setups.

# 1. Install Ollama (need v0.14.0+ for the Anthropic endpoint).
curl -fsSL https://ollama.com/install.sh | sh
ollama --version # confirm >= 0.14.0
# 2. Pull a TOOL-CAPABLE model. Not a stock instruct model.
ollama pull qwen3-coder # strong agentic / tool-call behaviour
# (alternative: a glm-4.x build with tool calling + long context)
# 3. Raise context. Claude Code's system prompt is large; 8K/16K will
# fail before your code arrives. 32K is the floor, 64K the sweet spot.
export OLLAMA_CONTEXT_LENGTH=32768
# 4. Point Claude Code at the local endpoint. Two env vars.
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama # required but ignored
# 5. Run it.
claude --model qwen3-coder

That’s the whole connection — nothing leaves the machine, nothing meters per request. Two known sharp edges so you don’t lose an afternoon. First, Ollama’s compatibility layer currently ignores tool_choice, so Claude Code can occasionally pick the wrong tool and loop — a model with clean native tool calling mostly behaves. Second, don’t route through LiteLLM when your backend is already native Anthropic; it adds a network hop for nothing. The proxy is only for the case where your only option is an OpenAI-format local endpoint.
Because every token costs latency at this speed, give the local model a profile that keeps it terse. Drop a CLAUDE.mdin your project root:

# CLAUDE.md — local model profile
## DEFAULTS
- You run on a local model at ~11 tokens/sec. Tokens are slow. Be terse.
- No preamble, no recap, no "here's my plan." Output the change.
## BEHAVIOR
- Edit the smallest number of lines that solves the task.
Never rewrite a whole file to change one function.
- Touch only files I name. Ask before creating new ones.
- After an edit, print a one-line summary of what changed. Nothing more.
## STACK
- [your locked stack here] — never propose alternatives.
## OUTPUT
- Cap responses at ~300 tokens unless I ask for more.
Long outputs stall at this token rate.

The metric to watch: time-to-first-token on a real task. Under ~5 seconds and the box feels like a coding partner. Creeping past 20–30 seconds on long contexts means your prompts are too big for the bandwidth — trim context before you blame the hardware.

The Math, Honestly

Here’s the honest version of the money story, because the viral one quietly cheats.
The pitch: $200 Claude Code Max, $200 ChatGPT Pro, $20 Cursor, $20 Gemini — $440 a month, $5,280 a year, “and the box pays itself off in 9 months.” The subscription numbers are real. The payback number doesn’t survive arithmetic.
If you fully replaced $440/month with a ~$2,200 box, payback would be about five months, not nine. So why does the thread say nine? Because nine months only works if you don’t fully replace your stack — and you won’t. That’s the part they bury.
https://pbs.twimg.com/media/HKzx4ccW8AApvA-.png
And the honest limitation, stated plainly: a local open-weight model on this box is not Claude Opus or GPT-5. Qwen3-235B and DeepSeek are strong, but not frontier-closed-model strong on hard reasoning, and you’ll feel the gap on your most demanding work. So the realistic move isn’t “cancel everything.” It’s “move the bulk, privacy-sensitive, high-volume, no-deadline work to the box, and keep one frontier subscription for the 10% that needs it.” That’s a ~$200/month offload, a ~11-month payback, and a box that then runs free.
https://pbs.twimg.com/media/HKzx7TwWwAAZz-V.png
This box is wrong for some people, full stop. If your models fit in 24–32GB, an RTX 4090 or 5090 runs them several times faster for similar money — buy the GPU. If you need frontier reasoning every day, no local setup matches it yet — keep paying. The box wins on exactly one axis: running models too big for a consumer GPU, privately, with no per-token meter, at a speed you can tolerate.
This was never a Nvidia killer. It’s a memory-capacity tool with a bandwidth ceiling, and once you see it that way the pitch stops being hype and starts being a purchase decision you can get right. So before you spend $2,200, spend ten minutes: install Ollama on whatever machine you own, pull Qwen3-Coder, set those two env vars, point Claude Code at localhost, and watch your real tokens-per-second crawl by — because if 11 of them a second feels fine to you, then buy the 128GB box, not the $1,499 one.

177