Groq and the LPU: the record on the inference-speed story

You type a question and the answer doesn't arrive line by line. It dumps. A whole paragraph appears at once, faster than you can read it, like the model already knew the answer and just had to spit it out. That's the moment a Groq demo lands. People who'd watched ChatGPT crawl out a token at a time saw a Groq stream in early 2024 and reacted the way you'd react to a magic trick — the gut "wait, how is that even possible" before the brain catches up.

This is the record on what's actually behind that trick, what's verified, what's still vendor copy, and where the company sits now — because the business story took a hard turn in late 2025.

What the LPU actually is

Groq was founded in 2016 by ex-Google engineers led by Jonathan Ross, who had helped design Google's Tensor Processing Unit, and Douglas Wightman from Google X (per Wikipedia). The chip was first called the Tensor Streaming Processor (TSP), then rebranded the Language Processing Unit (LPU) "to make the nature of the processor more obvious" (Wikipedia).

The LPU is built for inference — running an already-trained model — not training, which is the heavier job GPUs dominate. The design pitch, in Groq's own words: a "single-core deterministic architecture" where "the Groq Compiler is fully deterministic and schedules every memory load, operation, and packet transmission exactly when needed" (Groq, LPU architecture). It strips out the reactive hardware a GPU leans on — branch predictors, caches — and hands all timing control to the compiler (Wikipedia).

The deterministic claim, plainly

Here's the part that's a genuine engineering argument, not marketing. A GPU spends a lot of its life waiting — for memory to load, for a cache to fill, for a packet that collided to resend. Groq says its LPU "never has to wait for a cache that has yet to be filled, resend a packet because of a collision, or pause for memory to load" (Groq blog).

Two design choices drive that. First, memory: the LPU uses on-chip SRAM as primary weight storage, which Groq claims is "100x faster than the HBM memory used by GPUs" (Groq). Second, scheduling: because every operation's timing is fixed at compile time, Groq says it runs "nearly 100% of its compute capacity for the actual workload," versus GPUs that it claims "often run at 30–40% utilization during inference because they are waiting on memory" (Groq).

Verified vs hype: the deterministic, compiler-scheduled, SRAM-first architecture is real and well-documented — it's a legitimately different bet than Nvidia's. The "100x" and "100% utilization" figures are Groq's own numbers, not independent measurements, so treat them as the vendor's framing.

The real tokens-per-second

Speed is where Groq earned the buzz, and here independent data exists. Groq's first public showing was strong enough that the benchmark firm ArtificialAnalysis had to "double the axis" on its chart to fit the LPU's results (Groq).

On reported throughput: Llama-3 70B runs at roughly 800 tokens/second sustained on Groq, and Google's Gemma 7B has hit around 2,800 tokens/second for the instruction-tuned variant (figures surfaced in third-party 2026 benchmark coverage). For contrast, those same write-ups peg typical GPU-based providers at 50–150 tokens/second. A widely-cited "exceeding 1,600 tokens/second" claim circulates too, but it traces to secondary blogs rather than Groq or an independent lab — so read it as reported, not confirmed.

The honest summary: the order of magnitude is real. Groq is consistently several times faster on per-user token throughput than mainstream GPU inference in independent tests. The exact top-line number depends heavily on which model, which test, and who's running it — and the flashiest figures are the least independently verified.

The plot twist: Nvidia, then a re-staff

The technical story is only half of it now. In December 2025, Nvidia signed what it framed as a non-exclusive licensing agreement for Groq's technology — and hired away founder/CEO Jonathan Ross, president Sunny Madra and other staff — in a deal reported at roughly $20 billion, described as a "not-acqui-hire" (TechCrunch).

Groq survived as an independent company but had to rebuild. Doug Wightman, the co-founder who stayed, became CEO; new executives came in from xAI, Meta and Nuvalence (TechCrunch). On June 22, 2026, Groq confirmed a $650 million raise led by Disruptive and the Infinitum hedge fund, pivoting hard toward its inference cloud — 13 data centers and, by the company's count, "over five million developers" (Groq newsroom). Groq didn't disclose a new valuation; it was last marked at $6.9 billion after a $750M round in September 2025 (TechCrunch).

The record

The instant-LLM demos are real, and the architecture behind them — deterministic, compiler-scheduled, SRAM-first — is a real divergence from the GPU world, not a paint job. The biggest speed and utilization numbers, though, are mostly Groq's own or come from secondary blogs; the independently-tested gains are large but less extreme than the viral figures. And the company that made the chips that started the hype just licensed its core tech to Nvidia and shipped its founder along with it. The speed was kooky-fast and turned out true. The chipmaker's future is now a cloud business — and that part is still being written.

Image: Groq logo, Wikimedia Commons, CC BY-SA 4.0.