-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063
base: master
Are you sure you want to change the base?
Conversation
…ously and thread safe
…ing to review comments
This comment was marked as off-topic.
This comment was marked as off-topic.
Yeah, just to clarify, @zhouwg is not affiliated with us, but we appreciate his support! Anyone interested in discussing QNN-related topics is very welcome to join the conversation. |
ggml/src/ggml-qnn/graph.cpp
Outdated
} | ||
|
||
bool qnn_graph::build_graph_from_ggml_graph(const ggml_cgraph *cgraph) { | ||
QNN_LOG_DEBUG("[%s][%s]build start", get_backend_name(_device), _graph_name.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's how we map ggml_cgraph
into a qnn graph
ggml/src/ggml-qnn/dl_loader.hpp
Outdated
return reinterpret_cast<Fn>(dl_sym(handle, function_name)); | ||
} | ||
|
||
} // namespace qnn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: this dl_loader
can be remove if upstream provide a unified dynamic load machanism
llama.cpp/ggml/src/ggml-backend-reg.cpp
Line 99 in 34a846b
static dl_handle * dl_load_library(const std::wstring & path) { |
I'd like to rephrase my previous statement. I appreciate your earlier work, as my fork is based on your initial PR |
} | ||
|
||
if (_rpc_buffer) { | ||
memcpy(_rpc_buffer->get_buffer(), _buffer->get_buffer(), _buffer->get_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great effort! According to QNN Shared Memory Doc, the the _rpc_buffer in HTP can be directly accessed by CPU. Maybe there can be a no copy implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thank you for the reminder! current the rpc buffer is disabled:
bool should_use_mem_handle() const {
// TODO: figure out how to set rpc mem to multiple tensor
return false;
}
thought we can reuse the rpc buffer for backing ggml tensor in the future, but now its disable by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have an item in my project backlog here: qnn backend (view)
ggml/src/ggml-qnn/op-config-impl.cpp
Outdated
return true; | ||
} | ||
|
||
bool ggml_qnn_matmul_op_config::create_mat_mul_nodes(QNNBackend device, Qnn_GraphHandle_t graph_handle, const int rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's how we create corresponding mat_mul
op, and the op will looks like:
which following ggml's guide line:
https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
ggml/src/ggml-qnn/backend-ops.cpp
Outdated
output += ')'; | ||
} | ||
|
||
void get_graph_key_from_cgraph(const ggml_cgraph *cgraph, std::string &output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generates a unique key for a given ggml_cgraph
. The key is constructed by concatenating the descriptions of the operations and their associated tensor dimensions within the graph.
Example key format: MUL_MATf32_256x16x10f32_256x1x10f32#LOG#ADD#ADDf32_16x1x10f32
May need some refactoring here to handle more complex graph structures and edge cases
* fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions
* debug * disable reshape * make sure single node op have same type * fix warning at the logger * Revert "disable reshape" This reverts commit 5aeca4b.
* print build type * wip * print compiling flags * wip * wip
Notice you've edited your original post with additional information. I'd like to clarify that my intent was to address specific technical issues that have existed throughout your PR series. without implementing correct matrix transposition, the And to reiterate: please focus on improving your codebase in an objective manner without making assumptions about or judging others' work. If you have any thoughts on my source code implementation, would be very welcome! I'm open to discussion about the design, implementation details, or any other technical aspects of the code. Collaborative feedback helps us all build better software. By sharing insights about implementation approaches, performance considerations, and edge cases, we collectively create more reliable and efficient code than any individual contributor could achieve independently. (Not gonna lie - it can be tough sometimes, but I'm all about keeping an open mind and hearing different viewpoints. Just trying my best here!) |
This comment was marked as off-topic.
This comment was marked as off-topic.
QNN_LOG_DEBUG("[%s][%s]op was unsupported, support/unsupported: %d/%d\n", qnn::get_backend_name(ctx->device), | ||
ggml_op_name(op->op), ctx->supported_op_count.load(), ctx->unsupported_op_count.load()); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our recent PR, we added a counter to track which operations are successfully offloaded to the qnn backend. While testing with the llama-3-8B-Instruct-Q4_K_M
model, found an interesting result:
Current Status
- Even though quantized tensor support isn't implemented yet, many operations are still being processed by the qnn backend since they operate on
F32
data - As shown in the screenshot, we're seeing significant operation offloading opportunities
- However, no
MUL_MAT
op are currently being offloaded to qnn, which are critical for performance
Next Steps
Based on this analysis, I'm shifting focus a bit to implement support for additional operation types that can be offloaded from cpu to qnn - this will provide immediate performance benefits while running models on device...
Simultaneously, will continue investigating how to port GGML's quantization scheme to QNN - this remains a core objective for our long-term performance goals, especially for quantized models like the one shown in the testing.
Test method and Resources
- Push llm model to android device folder
/data/local/tmp
- Run
scripts/run_device_model.sh --verbose --model-name 'meta-llama_Meta-Llama-3-8B-Instruct-Q4_K_M.gguf'
,run_device_model.sh
can be found here
Full running log:
run_model.8b.q4.debug.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chraac Does this backend work simultaneously with the Adreno OpenCL backend?
Is the idea to offload as much as possible to the NPU, then OpenCL, and then the CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this backend work simultaneously with the Adreno OpenCL backend?
Hi @conradev,
Thank you for sharing that link. while both solutions aim to improve performance on Qualcomm SoCs, they take different approaches:
-
the Adreno OpenCL backend is specifically optimized for Adreno GPUs, building on the original OpenCL backend with Adreno-specific optimizations.
-
my implementation leverages QNN SDK, which is Qualcomm's official ML inference framework. it works as a higher-level abstraction layer that maps GGML operations to QNN's native operations. This approach can target multiple compute devices (CPU, GPU, and NPU) on Qualcomm platforms, providing greater flexibility in deployment scenarios.
these implementations represent two distinct but complementary approaches to hardware acceleration on Qualcomm devices - one focused specifically on Adreno GPU optimization via OpenCL, and the other providing a vendor-supported framework integration with broader device support.
Is the idea to offload as much as possible to the NPU, then OpenCL, and then the CPU?
Short answer: It's up to the GGML framework's scheduler to make that decision.
In our implementation, we simply provide capability information to GGML about whether each QNN device (CPU/GPU/NPU) can support specific operations. We also indicate the device type (CPU/GPU/ACCEL) for each QNN backend. The GGML framework then uses this information to determine which device should execute each operation.
For the llama-3-8B-Instruct-Q4_K_M
, we've observed that the scheduler tends to prefer qnn-gpu
over qnn-npu
for many operations. This preference is likely based on the device type classifications we provided to the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering set all QNN devices (CPU/GPU/NPU) as GGML_BACKEND_DEVICE_TYPE_ACCEL
. this approach would give them the same scheduling priority during offloading and should result in more operations being scheduled to the NPU.
@conradev , I'd appreciate your thoughts on this approach. thanks!
Let's see what @slaren said in you PR:
I'm focused on improving the QNN backend support and welcome technical discussions on this topic. As the maintainer noted, provoking personal conflict isn't encouraged. Comments that stray from technical feedback will not receive a response from now on. |
* move op key generate function to kOpCaps * fix op desc print * try fix rms_norm * Revert "try fix rms_norm" This reverts commit 33b2960. * add quantization type support by converting them to float * enable quantization tensor for mulmat in gpu/npu * fix asan error * add log and assert * insert output convert operator after mulmat * add log * fix some error in running * disable permute again * add log * add error function * Revert "add error function" This reverts commit f92ff47. * add log * more log * disable convert op in graph * wip * add f16 config for graph * set f16 precision for f16 graph * fix override data type * add comment * add config flag to enable quantize type * add log * more quantized type for cpu and gpu backend * enable all quant types for cpu and gpu backend * rename * wip * add log * remove unused functions * skip permute * remove get_qnn_op_input_param_count * fallback to generic_get_op_desc if no op_desc * revert 'skip permute' * Revert "revert 'skip permute'" This reverts commit 5761e31. * wip * add log * print qnn tensor type * add log * limit the max size of tensor * add log * fix tensor size limiter * small improve on tensor info printer * disable sqrt and div to pass test-backend-ops for 8 gen 2 * remove debug log in release build * add log * skip permute in src * wip * disable reshape * skip mul at decoder start * wip * add log * add qnn_scoped_timer * add perf tracker in graph * add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING * fix flag name * use milli-second * wip * fix comment string * add file for profiler * change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu * wip * profiler: refactoring * wip * add implement for print_profile_events * set-up profiler for graph * set profiler to graph execute * pretty print events * unified log print prefix * print event count * enable optrace * print duration at event end * wip * add more detailed soc information * wip * move device caps array into qnn-lib.cpp * remove lib_name in device_context * move get_graph_key_from_cgraph to graph.cpp * add override type for tensor key * use override_type instead of original data type for graph key * append op type to tensor name to fix error in qwen * remove todo * wip
auto old_mode = SetErrorMode(SEM_FAILCRITICALERRORS); | ||
SetErrorMode(old_mode | SEM_FAILCRITICALERRORS); | ||
|
||
auto handle = LoadLibraryA(lib_path.c_str()); // TODO: use wstring version for unicode paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @slaren, noticed we have similar dynamic library loading functionality in ggml-bacnend-reg.cpp (the dl_load_library
function) that could be useful in other parts of the codebase.
I suggest moving this to a common utility module so we can reuse it across the project. This would help reduce code duplication and provide a consistent approach to loading libraries.
I'd be happy to prepare another PR about that, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this. I think that this code is small enough that it is not really a problem if it is duplicated in a backend, and making it part of the public API available to backends may make it harder to change it in the future. So at the moment my preference would be to avoid this.
Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines
Summary
This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.
This backend is organized into three distinct integration layers:
GGML Adaptation Layer
Graph Caching, Mapping, and Execution:
backend-ops.cpp
) to minimize redundant graph creation and boost execution performance.op-config-caps.cpp
andop-config-impl.cpp
).Tensor Binding and Execution Flow:
tensor.hpp
andgraph.hpp
), managing both host and RPC memory via buffer interfaces likeqnn_buffer_interface
.QNN Object Layer
QNN System and Instance Management:
qnn_system_interface
class, originally derived from executorch, to create and free the QNN system context.qnn_instance
classload_backend()
andload_system()
) that retrieve provider lists and choose valid QNN interfaces based on API version checks.Dynamic Resource Handling:
load_lib_with_fallback()
to reliably load both the system and RPC libraries.Utility Layer
Dynamic Library Loading & Search Path Management:
qnn-lib.cpp
to manage dynamic library loading with fallbacks.insert_path()
andset_qnn_lib_search_path()
to configure environment variables (likeLD_LIBRARY_PATH
on Linux andADSP_LIBRARY_PATH
on Android) based on a custom library search path.General Utilities:
Key Features and Improvements
Graph Mapping Mechanism:
Backend Context and Device Management:
Build
For build instructions please refer to this page
Testing
Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using
test-backend-ops
—this is integrated into the pipeline for each commit node of thedev-refactoring
branch.Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
Table below shows GIFs of qnn backend running on different platforms
Current state
Future development