Enable zero-copy for QNN GPU#2105
Conversation
There was a problem hiding this comment.
This change does not look ideal to me, we need to discuss why it deviates from the norm.
Ideally what I like to see is no change to genai at all.
When OrtMemoryInfo is of type QnnShared can we just return the QnnGpuAllocator based on the device selection made by the user ie GPU ?
Everything else should work transparently, just like how it works for the other EPs cuda, openvino etc.
I don't see a need for enable_dx12_shared_memory_allocator option, why not enable it always in the EP ?
Also we cannot have variable names like use_dx12_shared_memory etc which then turns out to be only a Qnn specific option. Ideally, I would want to avoid any code in genai that are EP specific.
6ea280c to
922e278
Compare
* Use the new GPU shared memory allocator in QNN EP to allocate the KV cache on CPU-accessible GPU memory. This provides a large speedup by eliminating unnecessary copy overhead. * Keep fallback path with CPU<->GPU copies if outdated QNN EP or driver dependencies prevent the shared allocator from being available. > Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
* Refactor QNN-specific checks out of the common `EnsureDeviceOrtInit` by adding `GetMemoryInfo` and `GetProviderOptionsForAllocatorSession` to `DeviceInterface`. * Also move a WebGPU branch out of the common code along similar lines.
922e278 to
cf07c29
Compare
There was a problem hiding this comment.
Pull request overview
Enables zero-copy KV-cache placement for the QNN GPU backend by using the shared-memory allocator exposed by recent QNN EP packages, with a runtime capability probe and fallback. Also refactors device-specific allocator/session-option logic out of the common EnsureDeviceOrtInit path by adding two new virtual methods (GetMemoryInfo, GetProviderOptionsForAllocatorSession) on DeviceInterface, with QNN and WebGPU providing overrides.
Changes:
- Add
DeviceInterface::GetMemoryInfo/GetProviderOptionsForAllocatorSessionplus aDeviceTypeToStringhelper, and replace ad-hoc per-device branches inEnsureDeviceOrtInitwith calls to these virtuals. - Add
IsQNNGPUBackendand aIsQNNGPUSharedAllocatorAvailableruntime probe (embedded trivial ONNX model) to opt the GPU backend into shared-memory allocation when supported, with a warning-and-fallback otherwise. - Move WebGPU's "WebGPU_Buf" vs "WebGPU_Buffer" name fallback into the WebGPU
GetMemoryInfooverride.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/smartptrs.h | Adds DeviceTypeToString and two new virtual methods on DeviceInterface with default implementations driven by per-device name tables. |
| src/models/model.cpp | Simplifies EnsureDeviceOrtInit to use the new virtuals; drops inline QNN/WebGPU special cases and uses DeviceTypeToString in error messages. |
| src/qnn/interface.h | Forward declarations plus IsQNNGPUBackend declaration. |
| src/qnn/interface.cpp | QNN overrides of the new virtuals (with GPU vs NPU device filtering) and implementation of IsQNNGPUBackend. |
| src/qnn/session_options.cpp | Adds shared-memory allocator capability probe via a hardcoded trivial ONNX model; selects QNN device only when GPU shared allocator is available or NPU shared allocator is explicitly enabled. |
| src/webgpu/interface.cpp | Adds WebGPU GetMemoryInfo with attempted fallback between "WebGPU_Buf" and "WebGPU_Buffer". |
* EP implementation changed to use this instead of choosing allocator by selected backend
johnpaultaken
left a comment
There was a problem hiding this comment.
Looks cleaner than existing code.
Qnn specific checks are moved to qnn specifie interface now.
And so is web gpu specific code.
allocate the KV cache on CPU-accessible GPU memory.
This provides a large speedup by eliminating unnecessary
copy overhead.
EnsureDeviceOrtInitby addingGetMemoryInfoandGetProviderOptionsForAllocatorSessiontoDeviceInterface.