I recently saw an interesting Linkedin post from French startup ZML demonstrating vendor-agnostic Llama 2 inference. The demo leveraged pipeline parallelism to distribute the inference across hardware from Nvidia, AMD, and Google. Each chip was housed in a separate location, and the inference results were streamed back to the Mac that kicked off the job. Check out the video on the original post.
The standout feature of this demo is its hardware agnosticism, allowing a single codebase to run seamlessly across three distinct hardware platforms. I saw mention of “platforms” named zml/cuda
, zml/rocm
, and zml/tpu,
hinting that this software may run on any CUDA hardware (H100, A100, etc) and any ROCm hardware like MI300X.
The point of the demo is not latency or throughput but rather the distribution of a deep neural network across separate hardware instances from different vendors. Yet despite using consumer hardware and contending with network latency, the responsiv…
Keep reading with a 7-day free trial
Subscribe to Chipstrat to keep reading this post and get 7 days of free access to the full post archives.