Confidential Computing (CC) is great for AI since it allows you to run workloads on someone else’s hardware while keeping all your proprietary data secure. It doesn’t even allow the cloud provider to access your data.
But, it has an issue… Research has shown that Confidential Computing adds a significant performance overhead on GPUs.
This post walks through these overheads and tries to explain them in a simple manner.
How does Confidential Computing work on GPUs?
At a high level, confidential computing locks down the GPU by encrypting GPU memory and ensuring that all data moved in or out of the GPU is encrypted and authenticated. It even protects GPU to GPU communication.
This is great for security, but it fundamentally changes the way GPUs behave. GPUs are built to be fast when they do a lot of math, and they are fast when they can freely move data at high frequencies.
The big issue is that confidential computing changes how GPUs operate, forcing them to work in ways they weren’t designed for.
The Two Different Patterns of GPU Overhead
1. Memory Heavy AI workloads
Most modern AI workloads (especially LLMs) don’t just compute, but move data at high frequencies.
So, model weights are swapped in and out, attention caches keep being moved between CPUs and GPUs, and memory pressure forces frequent transfers.
But, with confidential computing, all the swaps must be encrypted and decrypted. So, the CPU becomes a bottleneck, leaving the GPU idle while it waits for data.
In one paper, latency was measured to peak at 36.2%, 52.8%, and 88.2% depending on model size and memory pressure.
https://arxiv.org/abs/2411.03357
And in another paper, confidential computing added around 54% to 903.9% (Wow…) slowdown when inferencing on a single GPU and around 10% to 455% overhead when training on a single GPU!
https://arxiv.org/abs/2410.15240
2. Training with Multiple GPUs
This is where things go crazy!
When training with multiple GPUs, GPUs frequently share small updates with each other to stay in sync. Additionally, each exchange is broken up into multiple small messages.
When using confidential computing, every single GPU message must be encrypted and decrypted, making lots of small messages extremely expensive.
Researchers measured an average of a 768% overhead and a maximum of 4060% overhead for the training workloads running on multiple GPUs.
https://arxiv.org/abs/2501.11771
When Should You Use Confidential Computing?
Given the significant overheads of using confidential computing, when should we use it?
Confidential computing is a great idea when a workload is very computation heavy and not very communication heavy. What we see across all the papers is that when running confidential computing, high communication between the GPU and CPU leads to significant slowdowns. So, if a workload spends most of its time doing math in the GPU, the slowdowns can become negligible.
You could also use systems designed around confidential computing. So, one example could be PipeLLM (https://arxiv.org/abs/2411.03357), which anticipates which data the GPU will require and encrypts them at the same time as the computation to reduce confidential computing overhead.
Conclusion
Before reading these papers, I didn’t really know the overhead of using confidential computing on GPUs. I mean, I knew there was some overhead, but never thought it would/could be this crazy. My surprise is what prompted me to write this blog post.
Confidential computing is great for security, but because it fundamentally changes how GPUs operate, it adds a significant performance overhead. Before deciding on confidential computing as a default solution, it is important to know when it is a good fit.