Llama-3.1 Announcement
We are happy to announce that we have brought up support for Llama-3.1-70B inference on Tenstorrent’s 8-chip systems, the TT-QuietBox and the TT-LoudBox.
The source code for Llama-3.1-70B and other models that are supported is on our GitHub. We have also merged support for Llama-3.1-8B, running on our single-chip n150 card.
Implementation highlights:
- Fractured with 8-way tensor parallelism
- Uses FlashAttention and FlashDecode
- Uses Mixed BF16, BFP8, and BFP4 precision
- Performance was measured in eager mode with tracing disabled
We are working on optimizations which will get us to our target of 20 tokens/second/user. Buy our 8-chip systems (TT-QuietBox and TT-LoudBox) to try Llama-3.1-70B at home on Tenstorrent hardware!
Other articles
Tenstorrent and Movellus Form Strategic Engagement for Next-Generation Chiplet-Based AI and HPC Solutions
Enabling Cross-Foundry IP for Power and Performance Optimization
Attention in SRAM on Tenstorrent Grayskull
When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups.
Tenstorrent is Continuing its Contributions to the RISC-V Open Source Ecosystem
Today we are pleased to announce the release of our RISC-V Architectural Compatibility Suite, now available in our GitHub repository.