-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TensorRT] Fix perf issue for DDS nodes run by TRT 10 #23424
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
|
||
std::set<std::string> exclude_ops_set; | ||
std::set<std::string> exclude_ops_set; // currently not support to exclude ops | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::set<std::string> exclude_ops_set; | |
std::set<std::string> exclude_ops_set; // currently not support to exclude ops | |
std::set<std::string> exclude_ops_set; // currently not support to exclude ops | |
|
||
#if NV_TENSORRT_MAJOR >= 10 | ||
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10. | ||
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if NV_TENSORRT_MAJOR >= 10 | |
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10. | |
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_); | |
#endif | |
#if NV_TENSORRT_MAJOR >= 10 | |
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10. | |
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_); | |
#endif | |
std::vector<NodeComputeInfo>& node_compute_funcs) { | ||
|
||
#if NV_TENSORRT_MAJOR >= 10 | ||
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<NodeComputeInfo>& node_compute_funcs) { | |
#if NV_TENSORRT_MAJOR >= 10 | |
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10. | |
std::vector<NodeComputeInfo>& node_compute_funcs) { | |
#if NV_TENSORRT_MAJOR >= 10 | |
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10. |
#endif | ||
|
||
for (auto& fused_node_graph : fused_nodes_and_graphs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif | |
for (auto& fused_node_graph : fused_nodes_and_graphs) { | |
#endif | |
for (auto& fused_node_graph : fused_nodes_and_graphs) { |
* Check if DDS op is in the ComputeCapability/subgraph. | ||
*/ | ||
bool IsDDSOpInSubGraph(const GraphViewer& graph, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Check if DDS op is in the ComputeCapability/subgraph. | |
*/ | |
bool IsDDSOpInSubGraph(const GraphViewer& graph, | |
* Check if DDS op is in the ComputeCapability/subgraph. | |
*/ | |
bool IsDDSOpInSubGraph(const GraphViewer& graph, |
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph, | ||
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities, | ||
std::unordered_set<std::string>& dds_op_set) const { | ||
auto is_dds_op = [&](const auto& node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph, | |
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities, | |
std::unordered_set<std::string>& dds_op_set) const { | |
auto is_dds_op = [&](const auto& node) { | |
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph, | |
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities, | |
std::unordered_set<std::string>& dds_op_set) const { | |
auto is_dds_op = [&](const auto& node) { |
}; | ||
|
||
for (auto& compute_capability : compute_capabilities) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | |
for (auto& compute_capability : compute_capabilities) { | |
}; | |
for (auto& compute_capability : compute_capabilities) { |
There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10. The issue arises because when cudaStreamSynchronize is called after inference, GPU memory is released back to the OS. As a result, for the next inference run, TRT reallocates GPU memory again from the OS, introducing overhead and leading to performance degradation.
The solution is to increase the memory pool threshold, allowing TRT to retain the allocated memory and reduce this overhead.