Tokens Per Second Per Watt: A Useful Metric for Edge AI
A systems-level perspective on performance and energy efficiency
Over recent years, the NPU has emerged as a power-efficient solution for AI workloads.
In our previous discussion of NPUs (1, 2), we discussed the performance measurement du jour: TOPS. However, TOPS falls short because it doesn’t account for power consumption, which impacts battery performance.
For Small LLMs (SLMs) at the edge, performance efficiency takes precedence over raw performance. Performance is irrelevant if it drains the battery too quickly. Therefore, Performance per Watt, measured as TOPS/W, becomes the important metric—can the required TOPS fit within the power budget?
Yet, TOPS/W doesn’t capture the user experience. The chip might be efficient, but is it delivering the inference speed users require?
For edge SLMs, responsiveness and speed matter most to the user. Responsiveness is measured by time-to-first token (TTFT), while speed is captured by tokens per second (TPS). TTFT affects perceived snappiness, and TPS ensures the answer comes quickly enough for the user.
An edg…
Keep reading with a 7-day free trial
Subscribe to Chipstrat to keep reading this post and get 7 days of free access to the full post archives.