-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message queued with esp_mqtt_client_enqueue() are delayed (IDFGH-14719) #15458
Comments
Perhaps |
This helps to flush the already-queued messages without waiting a whole second between each one, but it doesn't solve the problem that it's a whole second before the first message is sent, after calling Using --- mqtt_client.c.orig 2025-02-25 10:41:20.280544113 +0000
+++ mqtt_client.c 2025-02-25 10:45:39.571513453 +0000
@@ -1577,6 +1577,7 @@ static void esp_mqtt_task(void *pv)
client->state = MQTT_STATE_INIT;
xEventGroupClearBits(client->status_bits, STOPPED_BIT);
while (client->run) {
+ uint32_t next_timeout = MQTT_POLL_READ_TIMEOUT_MS;
MQTT_API_LOCK(client);
run_event_loop(client);
// delete long pending messages
@@ -1665,6 +1666,7 @@ static void esp_mqtt_task(void *pv)
outbox_item_handle_t item = outbox_dequeue(client->outbox, QUEUED, NULL);
if (item) {
if (mqtt_resend_queued(client, item) == ESP_OK) {
+ next_timeout = 10;
if (client->mqtt_state.pending_msg_type == MQTT_MSG_TYPE_PUBLISH && client->mqtt_state.pending_publish_qos == 0) {
// delete all qos0 publish messages once we process them
if (outbox_delete_item(client->outbox, item) != ESP_OK) {
@@ -1693,7 +1695,7 @@ static void esp_mqtt_task(void *pv)
#endif
}
}
- }
+ } // else reduce next_timeout if it *will* time out before the max. */
if (process_keepalive(client) != ESP_OK) {
break;
@@ -1704,7 +1706,7 @@ static void esp_mqtt_task(void *pv)
ESP_LOGD(TAG, "Refreshing the connection...");
esp_mqtt_abort_connection(client);
client->state = MQTT_STATE_INIT;
- }
+ } // else reduce next_timeout if it *will* time out before the max. */
break;
case MQTT_STATE_WAIT_RECONNECT:
@@ -1733,7 +1735,7 @@ static void esp_mqtt_task(void *pv)
}
MQTT_API_UNLOCK(client);
if (MQTT_STATE_CONNECTED == client->state) {
- if (esp_transport_poll_read(client->transport, max_poll_timeout(client, MQTT_POLL_READ_TIMEOUT_MS)) < 0) {
+ if (esp_transport_poll_read(client->transport, max_poll_timeout(client, next_timeout)) < 0) {
ESP_LOGE(TAG, "Poll read error: %d, aborting connection", errno);
esp_mqtt_abort_connection(client);
}
|
The ESP-IDF MQTT component is fairly unusable for low-latency setups such as ESPHome. By default, ESPHome calls esp_mqtt_client_publish() directly from the MQTT component's loop() function, or when publishing status updates for other components. This may block for up to 20 seconds(!!) in adverse network conditions. With the `idf_send_async` option, subscribe and unsubscribe requests can still block the loop thread for multiple seconds, but publishing sensor updates is queued for esp-mqtt's own thread to actually send them. Which it does very slowly, no more than one per second, as discussed in espressif/esp-idf#15458 And to top it all off, even with `idf_send_async` set, the so-called 'asynchronous' send can still block for ten seconds because it takes the same MQTT_API_LOCK that the esp-mqtt thread holds while it runs its loop and a network send is timing out. This is reported in espressif/esp-idf#13078 The only way I can see to use esp-mqtt sanely is to use a thread of our own, queueing all sub/unsub/publish requests and invoking the esp-mqtt APIs from that thread. The existing RingBuffer abstraction works nicely for this as it already handles all the atomicity and waking when data are available. I've chosen to avoid allocations by passing the actual data through the ringbuffer, which means we impose a hard limit on the total topic+payload size for each request. An alternative would be to allocate copies in the enqueue() function and to pass *pointers* through the ringbuffer (which could be a different type of queue then, if we wanted to reinvent things).
The ESP-IDF MQTT component is fairly unusable for low-latency setups such as ESPHome. By default, ESPHome calls esp_mqtt_client_publish() directly from the MQTT component's loop() function, or when publishing status updates for other components. This may block for up to 20 seconds(!!) in adverse network conditions. With the `idf_send_async` option, subscribe and unsubscribe requests can still block the loop thread for multiple seconds, but publishing sensor updates is queued for esp-mqtt's own thread to actually send them. Which it does very slowly, no more than one per second, as discussed in espressif/esp-idf#15458 And to top it all off, even with `idf_send_async` set, the so-called 'asynchronous' send can still block for ten seconds because it takes the same MQTT_API_LOCK that the esp-mqtt thread holds while it runs its loop and a network send is timing out. This is reported in espressif/esp-idf#13078 The only way I can see to use esp-mqtt sanely is to use a thread of our own, queueing all sub/unsub/publish requests and invoking the esp-mqtt APIs from that thread. The existing RingBuffer abstraction works nicely for this as it already handles all the atomicity and waking when data are available. I've chosen to avoid allocations by passing the actual data through the ringbuffer, which means we impose a hard limit on the total topic+payload size for each request. An alternative would be to allocate copies in the enqueue() function and to pass *pointers* through the ringbuffer (which could be a different type of queue then, if we wanted to reinvent things). Fixes: esphome#6810
The ESP-IDF MQTT component is fairly unusable for low-latency setups such as ESPHome. By default, ESPHome calls esp_mqtt_client_publish() directly from the MQTT component's loop() function, or when publishing status updates for other components. This may block for up to 20 seconds(!!) in adverse network conditions. With the `idf_send_async` option, subscribe and unsubscribe requests can still block the loop thread for multiple seconds, but publishing sensor updates is queued for esp-mqtt's own thread to actually send them. Which it does very slowly, no more than one per second, as discussed in espressif/esp-idf#15458 And to top it all off, even with `idf_send_async` set, the so-called 'asynchronous' send can still block for ten seconds because it takes the same MQTT_API_LOCK that the esp-mqtt thread holds while it runs its loop and a network send is timing out. This is reported in espressif/esp-idf#13078 The only way I can see to use esp-mqtt sanely is to use a thread of our own, queueing all sub/unsub/publish requests and invoking the esp-mqtt APIs from that thread. The existing RingBuffer abstraction works nicely for this as it already handles all the atomicity and waking when data are available. I've chosen to avoid allocations by passing the actual data through the ringbuffer, which means we impose a hard limit on the total topic+payload size for each request. An alternative would be to allocate copies in the enqueue() function and to pass *pointers* through the ringbuffer (which could be a different type of queue then, if we wanted to reinvent things). Fixes: esphome#6810
The ESP-IDF MQTT component is fairly unusable for low-latency setups such as ESPHome. By default, ESPHome calls esp_mqtt_client_publish() directly from the MQTT component's loop() function, or when publishing status updates for other components. This may block for up to 20 seconds(!!) in adverse network conditions. With the `idf_send_async` option, subscribe and unsubscribe requests can still block the loop thread for multiple seconds, but publishing sensor updates is queued for esp-mqtt's own thread to actually send them. Which it does very slowly, no more than one per second, as discussed in espressif/esp-idf#15458 And to top it all off, even with `idf_send_async` set, the so-called 'asynchronous' send can still block for ten seconds because it takes the same MQTT_API_LOCK that the esp-mqtt thread holds while it runs its loop and a network send is timing out. This is reported in espressif/esp-idf#13078 The only way I can see to use esp-mqtt sanely is to use a thread of our own, queueing all sub/unsub/publish requests and invoking the esp-mqtt APIs from that thread. The existing RingBuffer abstraction works nicely for this as it already handles all the atomicity and waking when data are available. I've chosen to avoid allocations by passing the actual data through the ringbuffer, which means we impose a hard limit on the total topic+payload size for each request. An alternative would be to allocate copies in the enqueue() function and to pass *pointers* through the ringbuffer (which could be a different type of queue then, if we wanted to reinvent things). Fixes: esphome#6810
Hi @dwmw2 thanks for reporting the issue. Do adjust CONFIG_MQTT_POLL_READ_TIMEOUT_MS from Kconfig help improve the system behavior for your scenario? For the async scenarios we have We don't process more than one message per loop iteration to avoid holding the client mutex for too long. I'm assuming that you are publishing very often, could you describe your system so I can understand if we have some workaround for your context or if we need to add something to the component. Could you describe the scenario you get the 20 s block with network disconnected? |
I'm using ESPHome, which will publish a set of messages (for each component's state) when it reconnects to MQTT. I've just deployed a device in a part of my house with poor network coverage. When the WiFi isn't behaving, |
Without actually testing: Yes, reducing That's why my proof-of-concept hack above reduces the poll timeout when there are messages queued, but stays at 1000ms when the queue is empty, to try to balance between wasted time and sending rate. But even with that hack, |
@dwmw2 we are aware of the One possible change is to enable the flush of the queue if enabled by the user. So add a new entry on the config struct to enable flushing the queue when needed. Similar to the solution you suggested, but controlled to keep current behavior. As you pointed, we need to balance the load of the client with the retries and accommodate scenarios with poor network connections. |
Thanks, @euripedesrocha. Yes, I'm aware of the Despatching from the queue doesn't have to have a significant delay. What if the thread waited not just in I'm thinking of a design where there's a separate queue (or perhaps even a Queue) between Then the loop in That should actually be fairly simple. I'd hack up a proof of concept but I don't know FreeRTOS well enough to implement the part which waits for either the transport to be readable, or the Queue (or other form of queue/semaphore) to wake up. |
@dwmw2 The solution you propose is something you can achieve by providing your own outbox implementation and sharing the queue with your system. To have it as a general solution might have some issues like the msg id that needs to be returned when enqueueing the message. What makes this a not so trivial task is that the whole library was designed considering that it would be locked for the whole processing. While it makes easier to consider the potential concurrency problems by avoiding them entirely, it makes hard to refactor to accomplish the async nature that Since this is a recurring issue in the library, I'll prioritize the work on this. As I mention, the original design that we assumed would fix the problem rose other issues in testing and question in review, hence it was abandoned. |
Hm, I thought I'd looked at that (basically the idea being just to copy But things like And it still doesn't solve the problem mentioned in this ticket, that But yes, this is fairly close to the solution I came up with for ESPHome in esphome/esphome#8325 — just put the messages ( |
This isn't exactly a problem but a design decision. If you are doing an async operation, we assume that the message can be published later. You have control on the load you intend to have on your system by controlling the network related timeouts and task priority. If you want the message to be published right away, there is I understand that in scenarios of a poor network this imposes some blocks on the system, but we always prioritize that the process is completed and messages are sent, and the client is always in a consistent state. I can add the option to the config to process all message to be published in the loop, and a PR from your side is welcomed as well. But this needs to be opt in and keep the current behavior as default. |
Ah, the perennial 'bug' vs. 'feature' debate. Which philosophically often ends up boiling down to how well-documented the behaviour in question is. To be clear, the behaviour in question is that messages enqueued with But If you want to call it a design decision, I think we have to fix the documentation accordingly, don't we?
Rather than hogging CPU by sending all queued messages in the loop, my hack above was just setting the timeout to something like 10ms (when there were queued messages) so that the thread would wake sooner and send the next message in a more reasonable amount of time. Is that something that could be refined and made the default? The other thing that would still be missing is a way to wake the thread when a message is first enqueued, so that first message isn't delayed for a second. The main thing I'm trying to avoid is having to have two threads, as in my current workaround. Here's another idea which might work... extract the core of the Then the application's own thread can suck messages off its own nonblocking queue (like that one I implemented for ESPHome), queue them up with |
@dwmw2 thanks for the suggestions for the improvements. There is no debate regarding the async nature of the publishing of The proposed solution of processing all unpublished messages on each loop will only hog the CPU depending on the frequency of your published messages. I believe we are discussing multiple problems at once:
In summary:
To fix all it will take us some time and the regular IDF workflow to be available on idf branches. I will gladly review any PR from your side, either here on esp-mqtt repository. |
Answers checklist.
IDF version.
v5.1.5
Espressif SoC revision.
ESP32-S3
Operating System used.
Linux
How did you build your project?
Other (please specify in More Information)
If you are using Windows, please specify command line type.
None
Development Kit.
Custom Board
Power Supply used.
USB
What is the expected behavior?
Using ESPHome with the
idf_send_async
option to make it useesp_mqtt_client_enqueue()
, I expected it to send my queued messages promptly.(I set this option because I didn't like it when
esp_mqtt_client_publish()
blocks for 20 seconds when the network is down).What is the actual behavior?
Each time around the loop in
esp_mqtt_task()
, it attempts to send precisely one message, then goes to sleep for another second before sending the next, leading to a large backlog and lost messages.Steps to reproduce.
Set
idf_send_async
in an ESPHome configuration and watch how long it takes for MQTT messages to be sent.Debug Logs.
More Information.
This patch helps a bit, by flushing the whole queue every time. But there's still up to a second (MQTT_POLL_READ_TIMEOUT_MS) before each message is sent. Can we wake the task when a message is queued?
The text was updated successfully, but these errors were encountered: