Custom Triton RMSNorm kernel
InferenceA hand-written Triton kernel takes a memory-bound op from PyTorch's 11% bandwidth utilization to 88%, an 8.1x speedup.
Type to search posts and projects ↑↓ to navigate
Measured results from the work, each linked to the post, paper, or code that reproduces it. Numbers first, claims second.
A hand-written Triton kernel takes a memory-bound op from PyTorch's 11% bandwidth utilization to 88%, an 8.1x speedup.
Linear probes on hidden states detect deliberate underperformance across Mistral, Gemma, and Qwen, before any output is generated. Steering cut Gemma sandbagging by 20%.
If a number here isn't reproducible from a linked repo or paper, it doesn't belong on this page.