Skip to content

Enable zero-copy for QNN GPU#2105

Open
qti-mattsinc wants to merge 8 commits into
microsoft:mainfrom
CodeLinaro:dev/mattsinc/gpu-zero-copy
Open

Enable zero-copy for QNN GPU#2105
qti-mattsinc wants to merge 8 commits into
microsoft:mainfrom
CodeLinaro:dev/mattsinc/gpu-zero-copy

Conversation

@qti-mattsinc
Copy link
Copy Markdown

@qti-mattsinc qti-mattsinc commented Apr 27, 2026

  • Use the new GPU shared memory allocator in QNN EP to
    allocate the KV cache on CPU-accessible GPU memory.
    This provides a large speedup by eliminating unnecessary
    copy overhead.
  • Refactor QNN-specific checks out of the common
    EnsureDeviceOrtInit by adding GetMemoryInfo and
    GetProviderOptionsForAllocatorSession to DeviceInterface.

Comment thread src/models/onnxruntime_api.h
Comment thread src/models/model.cpp Outdated
Copy link
Copy Markdown

@johnpaultaken johnpaultaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not look ideal to me, we need to discuss why it deviates from the norm.
Ideally what I like to see is no change to genai at all.
When OrtMemoryInfo is of type QnnShared can we just return the QnnGpuAllocator based on the device selection made by the user ie GPU ?
Everything else should work transparently, just like how it works for the other EPs cuda, openvino etc.
I don't see a need for enable_dx12_shared_memory_allocator option, why not enable it always in the EP ?
Also we cannot have variable names like use_dx12_shared_memory etc which then turns out to be only a Qnn specific option. Ideally, I would want to avoid any code in genai that are EP specific.

@qti-mattsinc qti-mattsinc force-pushed the dev/mattsinc/gpu-zero-copy branch from 6ea280c to 922e278 Compare May 12, 2026 21:22
vjatoth-qti and others added 2 commits May 15, 2026 15:08
* Use the new GPU shared memory allocator in QNN EP to
  allocate the KV cache on CPU-accessible GPU memory.
  This provides a large speedup by eliminating unnecessary
  copy overhead.
* Keep fallback path with CPU<->GPU copies if outdated QNN
  EP or driver dependencies prevent the shared allocator from
  being available.

> Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
* Refactor QNN-specific checks out of the common
  `EnsureDeviceOrtInit` by adding `GetMemoryInfo` and
  `GetProviderOptionsForAllocatorSession` to `DeviceInterface`.
* Also move a WebGPU branch out of the common code along
  similar lines.
@qti-mattsinc qti-mattsinc force-pushed the dev/mattsinc/gpu-zero-copy branch from 922e278 to cf07c29 Compare May 15, 2026 22:09
@qti-mattsinc qti-mattsinc changed the title WIP: Enable zero-copy for QNN GPU Enable zero-copy for QNN GPU May 18, 2026
@qti-mattsinc qti-mattsinc marked this pull request as ready for review May 18, 2026 23:22
@qti-mattsinc qti-mattsinc requested a review from a team as a code owner May 18, 2026 23:22
Copilot AI review requested due to automatic review settings May 18, 2026 23:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables zero-copy KV-cache placement for the QNN GPU backend by using the shared-memory allocator exposed by recent QNN EP packages, with a runtime capability probe and fallback. Also refactors device-specific allocator/session-option logic out of the common EnsureDeviceOrtInit path by adding two new virtual methods (GetMemoryInfo, GetProviderOptionsForAllocatorSession) on DeviceInterface, with QNN and WebGPU providing overrides.

Changes:

  • Add DeviceInterface::GetMemoryInfo / GetProviderOptionsForAllocatorSession plus a DeviceTypeToString helper, and replace ad-hoc per-device branches in EnsureDeviceOrtInit with calls to these virtuals.
  • Add IsQNNGPUBackend and a IsQNNGPUSharedAllocatorAvailable runtime probe (embedded trivial ONNX model) to opt the GPU backend into shared-memory allocation when supported, with a warning-and-fallback otherwise.
  • Move WebGPU's "WebGPU_Buf" vs "WebGPU_Buffer" name fallback into the WebGPU GetMemoryInfo override.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/smartptrs.h Adds DeviceTypeToString and two new virtual methods on DeviceInterface with default implementations driven by per-device name tables.
src/models/model.cpp Simplifies EnsureDeviceOrtInit to use the new virtuals; drops inline QNN/WebGPU special cases and uses DeviceTypeToString in error messages.
src/qnn/interface.h Forward declarations plus IsQNNGPUBackend declaration.
src/qnn/interface.cpp QNN overrides of the new virtuals (with GPU vs NPU device filtering) and implementation of IsQNNGPUBackend.
src/qnn/session_options.cpp Adds shared-memory allocator capability probe via a hardcoded trivial ONNX model; selects QNN device only when GPU shared allocator is available or NPU shared allocator is explicitly enabled.
src/webgpu/interface.cpp Adds WebGPU GetMemoryInfo with attempted fallback between "WebGPU_Buf" and "WebGPU_Buffer".

Comment thread src/webgpu/interface.cpp
Comment thread src/qnn/interface.cpp Outdated
Comment thread src/qnn/session_options.cpp Outdated
@qti-mattsinc qti-mattsinc requested a review from Copilot May 22, 2026 17:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Comment thread src/smartptrs.h
Comment thread src/smartptrs.h
Comment thread src/qnn/session_options.cpp
Comment thread src/qnn/session_options.cpp
Comment thread src/qnn/session_options.cpp
Comment thread src/webgpu/interface.cpp Outdated
Comment thread src/models/model.cpp Outdated
Copy link
Copy Markdown

@johnpaultaken johnpaultaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cleaner than existing code.
Qnn specific checks are moved to qnn specifie interface now.
And so is web gpu specific code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants