Next Generation GPU Enablement (Blackwell, MI350)
High-performance Custom GPU Kernels
Advanced Algortihms for Large Language Models
Compiler & Ecosystem Advancements (Triton, GEMM Tuning)
Our work was recognized over $100 million infra annual saving, top-tier publications, & open-source contributions.
- Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions, Arxiv, Aug 11, 2025
- SpinQuant: LLM quantization with learned rotations, ICLR, 2025
- HadaCore: Tensor Core Accelerated Hadamard Transform Kernel, PyTorch Blog, Dec 12, 2024
- Context Parallelism for Scalable Million-Token Inference, MLSys 2025, Nov 10, 2024
- Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention, RecSys 2024, Sep 19, 2024
- The Llama 3 Herd of Models, July 31, 2024
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, NIPS 2024 Spotlight Poster, July 12, 2024
- INT4 Decoding GQA CUDA Optimizations for LLM Inference, PyTorch Blog, June 6, 2024
- Flash-Decoding for long-context inference, Stanford Blog, Oct 13, 2023
- Accelerated Generative Diffusion Models with PyTorch 2, PyTorch Blog, April 14, 2023
- Faster, more flexible inference on GPUs using AITemplate, a revolutionary new inference engine, Meta Research Blog, Oct 3, 2022
We contributed to open-source projects, including but not limited to: