돌아

TT-Deploy

Tenstorrent hosted TT-Deploy on May 1, launching the Tenstorrent Galaxy™ Blackhole, deployed at scale to deliver industry-leading general-purpose AI performance.

May 4, 2026

이 게시물 공유

Change is Constant

Jim Keller, CEO of Tenstorrent, opened TT-Deploy, our launch event, in San Francisco in front of an overflowing room with developers, customers, partners, investors and analysts. The message was clear: Today we are shipping third-party validated industry leading benchmarks and products at scale, with customers and partners defining the future of AI.

  • Tenstorrent Galaxy™ Blackhole: Now in production and shipping in volume, already running superclusters of 36 Tenstorrent Galaxies networked together as a single computer.
  • TT-QuietBox™ 2: Now available for purchase, a smaller water-cooled developer workstation born from feedback that the original units ran so loud that customers had to turn them off after dinner.

In front of a two year history of AI models, "Does anybody even remember Llama 70B?" Change is constant in AI, and we built Tenstorrent for change. "We are not trying to be a narrow provider. We're not trying to be a point solution. We want to solve a lot of different problems.”

That mission comes back to fundamentals: scale, general-purpose coverage across inference and training, and low cost designed in from the start rather than bolted on later. "You don't put four die in a big expensive package with a silicon interposer. That will never be cheap,." said Jim Keller. And the performance is landing. With DRAM, SRAM, and networking integrated on the same chip, we run prefill and decode on the same hardware, hitting 350+ tokens per second per user on DeepSeek across 16 Galaxies (671B parameters, batch size 32, 4-second time to first token) and enabling real-time video generation. "AI is a beautiful thing."

Run Fast

Jasmina Vasiljevic, who leads our AI software teams, walked through our new capabilities layered on top of our existing software foundation. From the bottom up, our software includes:

  • TT-Metal solves the data-movement problem at the kernel level
  • TT-NN launches ops across the supercomputer in a single line of Python and
  • TT-Lang and TT-Forge handle the higher-level DSL and compilation.

Blackhole's Tensix cores can run any kind of math imaginable, with independent cores that progress at their own pace rather than brute-forcing the problem with FLOPs – which is exactly why MoE models like DeepSeek V4 are such a strong fit. "AI is all about data movement, which is pretty exciting. So we're pretty happy to lean into that."

Jasmina then showed how Blackhole scales: 32 chips form a Galaxy, four Galaxies form a quad, and quads connect into superclusters via cabled all-to-all topology, with idle quads repurposable as switches. Critically, there are no Ethernet switches anywhere in the design. "All the traffic goes through galaxies and is programmed by our fabric," which keeps cost low and lets decode, prefill, and video generation run anywhere in the supercluster.

There is theory and then there is practice with serving AI models. Nvidia technically has a 300 t/s/user version of DeepSeek, but no provider actually serves it because GPU economics collapse at that throughput. Tenstorrent extends the low-cost serving curve instead, with 350 t/s/user DeepSeek shipping in two weeks and a roadmap to 500 t/s/user still at $6 per million tokens. "We are committed to crushing everybody at everything," said Jasmina Vasiljevic.

Prodia co-founders Mikhail Avady and Monty Anderson joined Jasmina on stage. Through our partnership, Prodia’s world record was 10x’d on Tenstorrent hardware making video gen faster than real-time, now at 2.5 seconds. As Monty put it: "Tenstorrent has proven to be the perfect target, optimized for inference, not just scaling through brute-force GPU."

"Tenstorrent understood the assignment." Yes, we did. Try it yourself today at console.tenstorrent.com

Run Anything

Stan Sokorac took the stage to answer a simple question: what else can Tenstorrent run? "Spoiler alert, it's everything." He laid out the goal he set for his team, which we call generality internally: "Run any model on any Tenstorrent hardware." The field moves so fast that a new state-of-the-art model lands every week, on top of the 2.8 million models already on Hugging Face. Customers don't want hardware that supports a curated list of ten models.

To prove out generality, Stan's team built an agentic pipeline running on a fleet of TT-QuietBoxes and Tenstorrent Galaxies that continuously pulls random models from Hugging Face, ports them, compiles them, and tests for accuracy. We watched for two things: real diversity across model types, and overall pass rate. After thousands of models, the rate has held steady at 90% — roughly 2.5 million Hugging Face models running on Tenstorrent. "That's not just achieving generality, that's crushing it."

He then walked through the software stack that makes it possible: TT-NN's op library with thousands of kernels for every kind of math; and TT-Lang, our newest addition — a Python-based DSL built on progressive disclosure, easy by default but with full hardware control when needed, and designed from day one with AI codegen in the loop. TT-Lang can take kernels written for other hardware like CUDA and convert them in seconds: "If there ever was a software moat in the AI industry, with AI codegen and tools like TT-Lang, it's been completely demolished." Combined with our TT-Forge compiler, we have everything needed to run anything. And, as Stan called out, most importantly "this is the only high-performance AI software stack out there that's 100% open source."

Run Anywhere

Two customer and partner panels rounded out the event, hosted by Amr Elashmawi, VP of Strategy and Business Development, focused on power efficiency, flexibility, openness, and day-zero model support. On the Deploy in Depth panel, Justen Aguillon from Equinix highlighted the efficiency gains driving our partnership: "You have an average two to three times lower power draw with Tenstorrent deployments, which allow customers to benefit from cost efficiencies." Abhishek Bhargava of BetterBrain pointed to architectural flexibility as the differentiator: "This is one of the reasons why we love working with Tenstorrent...because their chips enable such a huge diversity of models. We can tackle any use case on top of Tenstorrent."

On the Deploy at Scale panel, Alex Nataros from Cirrascale emphasized the TCO benefits of our switch-free architecture: "By simplifying and removing the network switches from this, it's just a lot less waste that goes into a deployment, which means we get a lot better cost efficiency to serve these models." Sanchayan Sinha from Turiyam described winning enterprise customers away from hyperscalers on cost: "We've won against them at this point in time because of the flexibility we have in the architecture." And Mike Gorbinski of Virtu Financial credited our adaptability for getting day-zero models like GLM-5.1 into production within weeks: "Because Tenstorrent is so adaptable, we managed to basically get things like GLM-5.1 off the ground quickly."

Where AI Runs

Jim closed the event with a love letter to DeepSeek v4. When the new model dropped, our team came back to Jim and said, "we think they made this one for us." Our software and hardware are general purpose. With the right combination of DRAM, SRAM, networking, and compute, we bring up new models fast. “The human brain is built from roughly a million cortical columns, each running at a few teraflops. About the same as one of our tensor processors,” said Jim Keller. And we can scale from a Tensor processor, to an array of processors, to an array of chips in a Tenstorrent Galaxy, to an array of Tenstorrent Galaxies in a supercluster, and from there, unlimited scaling. Jim shared our benchmark on Artificial Analysis’s site far to the right of every GPU competitor on throughput and announced that more were coming. Artificial Analysis measures actual served models in real use cases, not synthetic standalone numbers. Tenstorrent works for real AI workloads, not just benchmark theatre.

Tenstorrent is where AI runs.