Cerebras and the Wafer-Scale Engine: The Record on the Biggest Chip Ever Made
A developer waits. The cursor blinks. The model is "thinking." For most of the AI era, that pause has been the texture of the work — you ask, you wait, you read. Cerebras built a chip the size of a dinner plate to kill the pause. Here is what the chip actually is, what it actually does, and where the speed claims are verified versus where they are marketing.
The thing a person feels first: speed
If you have used a chatbot, you know the wait. The promise Cerebras keeps repeating is that the wait mostly disappears — text comes out faster than you can read it. That is the human pitch. Whether it holds up is a measurement question, and the good news is that an independent firm has been measuring. We will get there. First, the chip.
What the WSE actually is
Almost every other AI chip is cut out of a silicon wafer — one wafer yields dozens of small chips that get diced apart, packaged, and wired back together on circuit boards. Cerebras does the opposite: it keeps the whole wafer as one single chip.
Per Cerebras, the current generation, the Wafer-Scale Engine 3 (WSE-3), is built on a 5nm process and packs 4 trillion transistors and 900,000 AI-optimized cores onto a square of silicon roughly 21.5 centimeters per side — about 46,225 mm², which Cerebras and reporters describe as the largest chip ever built, roughly 57x the area of a flagship GPU die (Cerebras press release; IEEE Spectrum; TweakTown).
The other unusual part is memory. Cerebras states the WSE-3 carries 44 GB of SRAM directly on the chip, with claimed on-chip memory bandwidth of 21 petabytes per second (Cerebras). The full system, the CS-3, fits the wafer into a 15U rack box with proprietary water cooling and, per Cerebras, around 23 kW of power draw, and delivers a peak of 125 petaflops of AI compute (IEEE Spectrum).
Why the design claims to beat GPUs at inference
The argument is about where the data lives. On a GPU, the model's weights sit in separate high-bandwidth memory and have to be streamed to the compute cores for every token generated. When you are generating text one token at a time (the slow part of running a large language model), you are constantly hauling billions of parameters across that gap. Memory bandwidth, not raw math, becomes the bottleneck.
Cerebras's pitch is that by putting the cores and the memory on the same piece of silicon, the weights never leave the chip — so the wall that throttles GPUs largely goes away. That is the theory of why one giant chip can spit out tokens faster than racks of GPUs. It is a coherent engineering argument, and it is Cerebras's own framing (Cerebras).
Where the speed is verified — and where it is marketing
This is the part that matters for the "kooky till proven" test. Cerebras makes loud claims, but a third party — Artificial Analysis, an independent AI benchmarking firm — has measured the inference endpoints. Their numbers, not just Cerebras's, are what to anchor on:
- Llama 3.1 70B: Cerebras reports Artificial Analysis measured it crossing 2,100 tokens/second, a 3x jump over an earlier release (Cerebras blog).
- Llama 3.1 405B (the big one): Cerebras reports 969 output tokens/second, and claims this runs up to 75x faster than hyperscaler GPU offerings (Cerebras blog).
- Llama 4 Maverick: Cerebras says Artificial Analysis clocked its endpoint at 2,522 tokens/second, versus 1,038 tokens/second measured for an NVIDIA Blackwell system on the same model (Cerebras press release; HPCwire).
Verified: The raw output-speed numbers come from an independent benchmarker, which is a real check, not self-graded homework. On output tokens per second for these specific models, Cerebras has genuinely posted leading figures.
Marketing to discount: The eye-catching multipliers ("75x faster," "19x faster") are comparisons Cerebras chooses against particular GPU configurations and providers — apples-to-not-quite-apples. Speed is also only one axis; the benchmarks above measure output token rate, not cost per token, total throughput across many simultaneous users, or accuracy. Fast is verified. "Best value" or "best for every workload" is not the same claim.
The business record (because the chip needs a company behind it)
Speed only matters if the company survives. The record here is eventful. Cerebras filed to go public in September 2024, then withdrew after a U.S. CFIUS review of the stake held by Abu Dhabi's G42; the review concluded in October 2025 after G42's stake was restructured into non-voting shares (CNBC). Cerebras refiled for an IPO in April 2026 (TechCrunch) and, per reporting on its first day, began trading on Nasdaq (AGBI).
The S-1 numbers reported in the press are worth reading skeptically: $510 million in 2025 revenue, but with heavy customer concentration — reporting indicates a large share of 2025 revenue traced to G42-linked and UAE-based entities (CNBC; mostlymetrics). A company whose revenue leans on a small set of related buyers is a different risk than one with broad demand, no matter how fast the chip is. Cerebras also announced a large compute deal with OpenAI; treat the headline dollar figures as reported intentions until they show up as recognized revenue.
The honest summary
The WSE is real and genuinely strange: one wafer, one chip, 900,000 cores, memory on-die. The inference-speed lead on specific Llama models is verified by an independent benchmarker, which is more than most AI-hardware claims can say. What stays in the "prove it" column is the comparison math, the workloads outside raw output speed, and a business that — for now — depends on a narrow customer base. The pause is shorter. The full record is still being written.
Image: Cerebras logo, Wikimedia Commons (File:Cerebras_logo.svg), public domain under PD-textlogo (logo is trademarked). Direct file: https://upload.wikimedia.org/wikipedia/commons/1/15/Cerebras_logo.svg
Sources: Cerebras press release (WSE-3), Cerebras chip page, IEEE Spectrum, TweakTown, Cerebras 405B blog, Cerebras 3x-faster blog, Cerebras Maverick release, HPCwire, CNBC, TechCrunch, AGBI, mostlymetrics S-1 breakdown. Current as of June 2026; AI-hardware specs and financials move fast — verify before relying.