[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac · 2025-02-25T07:19:54Z

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1

GGML Adaptation Layer
- Graph Caching, Mapping, and Execution:
  - Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
  - Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
  - Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
- Tensor Binding and Execution Flow:
  - Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
  - Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
QNN Object Layer
- QNN System and Instance Management:
  - Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
  - Manages QNN instance creation and initialization via the qnn_instance class
  - Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
  - Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
- Dynamic Resource Handling:
  - Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
  - Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
Utility Layer
- Dynamic Library Loading & Search Path Management:
  - Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
  - Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
- General Utilities:
  - Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

Graph Mapping Mechanism:
- Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
- Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
- The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
Backend Context and Device Management:
- Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
- Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Build

For build instructions please refer to this page

Testing

Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

Platform test-backend-ops full console output

Android test-backend-ops_all_android_ff033e1.log

Linux test-backend-ops_all_linux_ff033e1.log

Windows To be fill
Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
Table below shows GIFs of qnn backend running on different platforms

Platform Soc Model Gif Origin video

Android 8 Gen 2 llama-3-8B-Instruct-Q4_K_M Recording_Muted_hevc.mp4

Windows To be fill

Current state

The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
Testing with llama3.2-1b/3b-f16/32 models yields expected results.
Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

Further feature support and device-specific optimizations are planned (see also the project backlog).
Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

… Direct) backend

…neously

…ously and thread safe

…ing to review comments

…lained in https://github.com/zhouwg/llama.cpp/pull/1

chraac · 2025-02-25T07:31:56Z

I don't know this Chinese programmer and I'm not a member of his team and I'd like to see his team's success in this great community. thanks.

Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation.

ggml/src/ggml-qnn/.clang-format

chraac · 2025-02-25T08:10:13Z

ggml/src/ggml-qnn/graph.cpp

+}
+
+bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) {
+    QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str());


here's how we map ggml_cgraph into a qnn graph

chraac · 2025-02-25T08:18:19Z

ggml/src/ggml-qnn/dl_loader.hpp

+    return reinterpret_cast<Fn>(dl_sym(handle, function_name));
+}
+
+} // namespace qnn


TODO: this dl_loader can be remove if upstream provide a unified dynamic load machanism

llama.cpp/ggml/src/ggml-backend-reg.cpp

Line 99 in 34a846b

static dl_handle * dl_load_library(const std::wstring & path) {

chraac · 2025-02-25T08:20:32Z

I didn't provide any support to @chraac and his team. as I said before: I don't know this guy and his team and I'd like to see their success in this community. thanks so much.

I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR

oreomaker · 2025-02-25T10:10:17Z

ggml/src/ggml-qnn/tensor.hpp

+        }
+
+        if (_rpc_buffer) {
+            memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size());


Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.

Yeah, thank you for the reminder! current the rpc buffer is disabled:

bool should_use_mem_handle() const { // TODO: figure out how to set rpc mem to multiple tensor return false; }

thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default

have an item in my project backlog here: qnn backend (view)

ggml/src/ggml-qnn/logger.cpp

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac · 2025-02-25T12:06:09Z

ggml/src/ggml-qnn/op-config-impl.cpp

+    return true;
+}
+
+bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank,


here's how we create corresponding mat_mul op, and the op will looks like:

which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac · 2025-02-25T16:33:22Z

ggml/src/ggml-qnn/backend-ops.cpp

+    output += ')';
+}
+
+void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) {


Generates a unique key for a given ggml_cgraph. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.

Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32

May need some refactoring here to handle more complex graph structures and edge cases

* fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions

* debug * disable reshape * make sure single node op have same type * fix warning at the logger * Revert "disable reshape" This reverts commit 5aeca4b.

* print build type * wip * print compiling flags * wip * wip

chraac · 2025-03-01T03:41:08Z

I never drop such a comment in other's PR, this is my first time in this great tech community which is out of mainland China,sorry to waste resource and time in public community, thanks.

Notice you've edited your original post with additional information. I'd like to clarify that my intent was to address specific technical issues that have existed throughout your PR series. without implementing correct matrix transposition, the mul_mat operation cannot function properly.

And to reiterate: please focus on improving your codebase in an objective manner without making assumptions about or judging others' work.

If you have any thoughts on my source code implementation, would be very welcome! I'm open to discussion about the design, implementation details, or any other technical aspects of the code.

Collaborative feedback helps us all build better software. By sharing insights about implementation approaches, performance considerations, and edge cases, we collectively create more reliable and efficient code than any individual contributor could achieve independently. (Not gonna lie - it can be tough sometimes, but I'm all about keeping an open mind and hearing different viewpoints. Just trying my best here!)

chraac · 2025-03-01T04:15:29Z

ggml/src/ggml-qnn/backend-ops.cpp

+        QNN_LOG_DEBUG("[%s][%s]op was unsupported, support/unsupported: %d/%d\n", qnn::get_backend_name(ctx->device),
+                      ggml_op_name(op->op), ctx->supported_op_count.load(), ctx->unsupported_op_count.load());
+    }
+#endif


In our recent PR, we added a counter to track which operations are successfully offloaded to the qnn backend. While testing with the llama-3-8B-Instruct-Q4_K_M model, found an interesting result:

Current Status

Even though quantized tensor support isn't implemented yet, many operations are still being processed by the qnn backend since they operate on F32 data

As shown in the screenshot, we're seeing significant operation offloading opportunities

However, no MUL_MAT op are currently being offloaded to qnn, which are critical for performance

Next Steps

Based on this analysis, I'm shifting focus a bit to implement support for additional operation types that can be offloaded from cpu to qnn - this will provide immediate performance benefits while running models on device...
Simultaneously, will continue investigating how to port GGML's quantization scheme to QNN - this remains a core objective for our long-term performance goals, especially for quantized models like the one shown in the testing.

Test method and Resources

Push llm model to android device folder /data/local/tmp

Run scripts/run_device_model.sh --verbose --model-name 'meta-llama_Meta-Llama-3-8B-Instruct-Q4_K_M.gguf', run_device_model.sh can be found here

Full running log:
run_model.8b.q4.debug.log

@chraac Does this backend work simultaneously with the Adreno OpenCL backend?

Is the idea to offload as much as possible to the NPU, then OpenCL, and then the CPU?

Does this backend work simultaneously with the Adreno OpenCL backend?

Hi @conradev,
Thank you for sharing that link. while both solutions aim to improve performance on Qualcomm SoCs, they take different approaches:

the Adreno OpenCL backend is specifically optimized for Adreno GPUs, building on the original OpenCL backend with Adreno-specific optimizations.

my implementation leverages QNN SDK, which is Qualcomm's official ML inference framework. it works as a higher-level abstraction layer that maps GGML operations to QNN's native operations. This approach can target multiple compute devices (CPU, GPU, and NPU) on Qualcomm platforms, providing greater flexibility in deployment scenarios.

these implementations represent two distinct but complementary approaches to hardware acceleration on Qualcomm devices - one focused specifically on Adreno GPU optimization via OpenCL, and the other providing a vendor-supported framework integration with broader device support.

Is the idea to offload as much as possible to the NPU, then OpenCL, and then the CPU?

Short answer: It's up to the GGML framework's scheduler to make that decision.

In our implementation, we simply provide capability information to GGML about whether each QNN device (CPU/GPU/NPU) can support specific operations. We also indicate the device type (CPU/GPU/ACCEL) for each QNN backend. The GGML framework then uses this information to determine which device should execute each operation.

For the llama-3-8B-Instruct-Q4_K_M , we've observed that the scheduler tends to prefer qnn-gpu over qnn-npu for many operations. This preference is likely based on the device type classifications we provided to the scheduler.

I'm considering set all QNN devices (CPU/GPU/NPU) as GGML_BACKEND_DEVICE_TYPE_ACCEL. this approach would give them the same scheduling priority during offloading and should result in more operations being scheduled to the NPU.
@conradev , I'd appreciate your thoughts on this approach. thanks!

chraac · 2025-03-02T02:10:47Z

I already blocked in this community before 02/16/2025 because of my stupid mistake last year which part of reasons came from this CN programmer in my first PR and which the main reason is my personal mistake, this CN programmer has already intended to use the maintainers 's hands to block me again in my third PR so his voice and misinformation can be seen by everyone in this tech community.

Let's see what @slaren said in you PR:

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

I'm focused on improving the QNN backend support and welcome technical discussions on this topic. As the maintainer noted, provoking personal conflict isn't encouraged. Comments that stray from technical feedback will not receive a response from now on.

* move op key generate function to kOpCaps * fix op desc print * try fix rms_norm * Revert "try fix rms_norm" This reverts commit 33b2960. * add quantization type support by converting them to float * enable quantization tensor for mulmat in gpu/npu * fix asan error * add log and assert * insert output convert operator after mulmat * add log * fix some error in running * disable permute again * add log * add error function * Revert "add error function" This reverts commit f92ff47. * add log * more log * disable convert op in graph * wip * add f16 config for graph * set f16 precision for f16 graph * fix override data type * add comment * add config flag to enable quantize type * add log * more quantized type for cpu and gpu backend * enable all quant types for cpu and gpu backend * rename * wip * add log * remove unused functions * skip permute * remove get_qnn_op_input_param_count * fallback to generic_get_op_desc if no op_desc * revert 'skip permute' * Revert "revert 'skip permute'" This reverts commit 5761e31. * wip * add log * print qnn tensor type * add log * limit the max size of tensor * add log * fix tensor size limiter * small improve on tensor info printer * disable sqrt and div to pass test-backend-ops for 8 gen 2 * remove debug log in release build * add log * skip permute in src * wip * disable reshape * skip mul at decoder start * wip * add log * add qnn_scoped_timer * add perf tracker in graph * add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING * fix flag name * use milli-second * wip * fix comment string * add file for profiler * change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu * wip * profiler: refactoring * wip * add implement for print_profile_events * set-up profiler for graph * set profiler to graph execute * pretty print events * unified log print prefix * print event count * enable optrace * print duration at event end * wip * add more detailed soc information * wip * move device caps array into qnn-lib.cpp * remove lib_name in device_context * move get_graph_key_from_cgraph to graph.cpp * add override type for tensor key * use override_type instead of original data type for graph key * append op type to tensor name to fix error in qwen * remove todo * wip

chraac · 2025-03-22T05:30:34Z

ggml/src/ggml-qnn/dl-loader.hpp

+    auto old_mode = SetErrorMode(SEM_FAILCRITICALERRORS);
+    SetErrorMode(old_mode | SEM_FAILCRITICALERRORS);
+
+    auto handle = LoadLibraryA(lib_path.c_str());  // TODO: use wstring version for unicode paths


Hi @slaren, noticed we have similar dynamic library loading functionality in ggml-bacnend-reg.cpp (the dl_load_library function) that could be useful in other parts of the codebase.
I suggest moving this to a common utility module so we can reuse it across the project. This would help reduce code duplication and provide a consistent approach to loading libraries.
I'd be happy to prepare another PR about that, WDYT?

Sorry, I missed this. I think that this code is small enough that it is not really a problem if it is duplicated in a backend, and making it part of the public API available to backends may make it harder to change it in the future. So at the moment my preference would be to avoid this.

zhouwg and others added 30 commits April 24, 2024 16:28

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

b0c3013

… Direct) backend

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

d325088

… Direct) backend

rebase

c75817b

refine ggml-qnn-ut program and script to make reviewers happy

9c872cb

review: replace external declaration with NDK header file

926a866

add supportive of quantize data type Q8_0

dd29834

review: remove unused QNN helper functions

f4c5303

ggml-qnn: remove static global vars to support multi-instance simulta…

2fab33d

…neously

review: remove static global vars to support multi-instance simultane…

94ee775

…ously and thread safe

review: put qnn's internal log inside preprocessor diretive

5d691c6

review: code format using clang-format + manually modification accord…

fdf0272

…ing to review comments

review: fix a memory leak introduced by review modification which exp…

3e8b61f

…lained in https://github.com/zhouwg/llama.cpp/pull/1

npu: probe htp info and capacity of rpc ion memory

d38d4a6

ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy

5f8cfe4

ggml-qnn: refine ggml inference using QNN NPU

5269e08

ggml-qnn: refine ggml inference using QNN NPU

faaa86b

review: make a MVP(Minimum Viable PR) style PR in upstream

5598fbd

init the test array with const values

5e18cdc

add ggml_qnn_tensor_binder

6c68adc

use tensor wrapper in add

37bb926

use tensor wrapper in matmul

36e41a1

use ggml_qnn_tensor_reader for output tensor

a5679dd

use ggml_qnn_tensor_writer for all parameters

5fe7b87

rename

9456bba

fix todo

65a14d9

make the constant condition first

aeef0c6

remove TODO

dfe159f

split logger function, tensors and backend from main qnn source

9932062

remove reference of g_qnn_mgr in qnn_instance

3c491a3

fix compiling error

3fe07eb

This comment was marked as off-topic.

Sign in to view

chraac commented Feb 25, 2025

View reviewed changes

ggml/src/ggml-qnn/.clang-format Outdated Show resolved Hide resolved

chraac commented Feb 25, 2025

View reviewed changes

oreomaker reviewed Feb 25, 2025

View reviewed changes

chraac commented Feb 25, 2025

View reviewed changes

ggml/src/ggml-qnn/logger.cpp Show resolved Hide resolved

chraac requested a review from oreomaker February 25, 2025 10:27

opt mulmat base on official doc (#25)

ff033e1

https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

chraac commented Feb 25, 2025

View reviewed changes

chraac mentioned this pull request Feb 27, 2025

[feat] fix some TODO item in upstream PR chraac/llama.cpp#26

Closed

chraac added 3 commits February 27, 2025 23:16

[bugfix]make sure single node op will have the same type (#29)

f289752

* debug * disable reshape * make sure single node op have same type * fix warning at the logger * Revert "disable reshape" This reverts commit 5aeca4b.

bug: fix benchmark debug warning (#31)

8b652dd

* print build type * wip * print compiling flags * wip * wip

This comment was marked as off-topic.

Sign in to view

chraac commented Mar 1, 2025

View reviewed changes

chraac added 3 commits March 5, 2025 22:22

Merge branch 'master' into dev-refactoring

27cec63

fix compiling error after merge

31847c8

Merge branch 'master' into dev-refactoring

525cd2d

chraac requested a review from conradev March 17, 2025 13:55

chraac added 2 commits March 22, 2025 12:34

Merge branch 'master' into dev-refactoring

c2887f0

zunigas88 approved these changes Mar 22, 2025

View reviewed changes

fix compiling error after merge

1caca62

chraac commented Mar 22, 2025

View reviewed changes

Merge branch 'master' into dev-refactoring

e4fcdd4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac commented Feb 25, 2025 •

edited

Loading

This comment was marked as off-topic.

chraac commented Feb 25, 2025

chraac Feb 25, 2025

chraac Feb 25, 2025

chraac commented Feb 25, 2025 •

edited

Loading

oreomaker Feb 25, 2025

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025

chraac commented Mar 1, 2025 •

edited

Loading

This comment was marked as off-topic.

chraac Mar 1, 2025 •

edited

Loading

conradev Mar 15, 2025

chraac Mar 15, 2025

chraac Mar 20, 2025

chraac commented Mar 2, 2025 •

edited

Loading

chraac Mar 22, 2025 •

edited

Loading

slaren Apr 7, 2025

Platform	test-backend-ops	full console output
Android		test-backend-ops_all_android_ff033e1.log
Linux		test-backend-ops_all_linux_ff033e1.log
Windows	To be fill

Platform	Soc	Model	Gif	Origin video
Android	8 Gen 2	llama-3-8B-Instruct-Q4_K_M		Recording_Muted_hevc.mp4
Windows	To be fill

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Are you sure you want to change the base?

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Conversation

chraac commented Feb 25, 2025 • edited Loading

Summary

Key Features and Improvements

Build

Testing

Current state

Future development

This comment was marked as off-topic.

chraac commented Feb 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chraac commented Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

chraac Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chraac commented Mar 1, 2025 • edited Loading

This comment was marked as off-topic.

chraac Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Current Status

Next Steps

Test method and Resources

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chraac commented Mar 2, 2025 • edited Loading

chraac Mar 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chraac commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac Feb 25, 2025 •

edited

Loading

chraac commented Mar 1, 2025 •

edited

Loading

chraac Mar 1, 2025 •

edited

Loading

chraac commented Mar 2, 2025 •

edited

Loading

chraac Mar 22, 2025 •

edited

Loading