[BUG] Using callbacks for error handling causes timeout #409

daveisfera · 2023-09-14T23:40:18Z

Media pipeline:
Using a modification of the sample that handles errors with callbacks (as recommended previously)

Logging
From running normally (without custom callbacks):

./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60

logs_0.txt

From running with stream callback:

./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60 "" 1

logs_1.txt

From running with client callback:

./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60 "" 2

logs_2.txt

From running with both callbacks:

./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60 "" 3

logs_3.txt

Describe the bug
Using the callbacks allows the user of the SDK to exit early rather than continuing to submit problematic data, but it doesn't exit properly and eventually is hard killed by a time out. Also, when the client callback is added, the stream callback is no longer called (you can see the lack of the expected error output in the logs run with 2 and 3 above when the client callback is enabled), so this leads me to believe that there's an error in the callback handling somewhere.

SDK version number
1.5.1

Open source building
N/A

To Reproduce
Steps to reproduce the behavior:

Clone the sample with callback added
Build
Run with callbacks enabled: ./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60 "" 3
Observe that sending data stops but program doesn't exit until after the SDK is hard killed by a time out

Expected behavior
Callbacks can be used to handle errors and the SDK still exits normally when using them

Desktop (please complete the following information):

OS: Debian Bookworm
Browser: N/A
Version: N/A

Additional context
I've been told after a lot of issues that everything is totally fine and then that my not using callbacks was the source of the issue, but this appears to be one more example of there being something wrong in the SDK...

The text was updated successfully, but these errors were encountered:

sharadrj · 2023-09-25T21:36:39Z

@daveisfera , sorry we (KVS engineers) haven't been able to look at this ticket earlier, but it is now prioritized for this week. We will get back with some follow-ups.

disa6302 · 2023-09-26T20:40:08Z

@daveisfera ,

Quick clarification on the callback structure within the SDK. Client callbacks are a higher level object for all the streams that the application is setting up, while stream callbacks are specific to a particular stream. If you check out default implementation for client error callback, we invoke the stream error report callbacks for every stream within a for loop.   I am also seeing this in your branch:
return continuousRetryStreamErrorReportHandler(customData, streamHandle, uploadHandle, erroredTimecode, statusCode);

 This technically would not even build unless you are somehow moving this define into the public header file. You will probably not even need to set up a client callback provider unless there is specific handling you would like to do in the callback common to all the streams. Your set up could just set up the stream callbacks which get invoked here:  
 https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/blob/master/src/source/CallbacksProvider.c#L1261-L1285

Now, coming to why it takes a long time to exit the application: stopStream timeout is set to 2 minutes by default and the stop actually completes only after receiving the last persisted ack. If this activity does not happen, we timeout on stopStreamSync() and proceed anyways. You can adjust the stopStreamTimeout using the clientInfo.stopStreamTimeout to a smaller value if you would like.

daveisfera · 2023-09-26T20:46:37Z

Quick clarification on the callback structure within the SDK. Client callbacks are a higher level object for all the streams that the application is setting up, while stream callbacks are specific to a particular stream. If you check out default implementation for client error callback, we invoke the stream error report callbacks for every stream within a for loop.   I am also seeing this in your branch: return continuousRetryStreamErrorReportHandler(customData, streamHandle, uploadHandle, erroredTimecode, statusCode);

Is there documentation of how this is should be used? I was just doing my best to guess how it would work and apparently was doing it wrong

 This technically would not even build unless you are somehow moving this define into the public header file. You will probably not even need to set up a client callback provider unless there is specific handling you would like to do in the callback common to all the streams. Your set up could just set up the stream callbacks which get invoked here:    https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/blob/master/src/source/CallbacksProvider.c#L1261-L1285

Why are you saying it wouldn't compile? You can checkout the sample and see that it builds and runs just fine

Now, coming to why it takes a long time to exit the application: stopStream timeout is set to 2 minutes by default and the stop actually completes only after receiving the last persisted ack. If this activity does not happen, we timeout on stopStreamSync() and proceed anyways. You can adjust the stopStreamTimeout using the clientInfo.stopStreamTimeout to a smaller value if you would like.

Adjusting the timeout doesn't feel like an actual solution. I don't see anything in the adjustment to the sample I submitted that should make it not work and exit normally, so why are you suggesting that as a solution rather than fixing the logic problem/bug in the SDK that's preventing it from working like it should?

disa6302 · 2023-09-26T21:14:32Z

Is there documentation of how this is should be used? I was just doing my best to guess how it would work and apparently was doing it wrong
There isnt specific documentation or example per se, but, https://github.com/awslabs/amazon-kinesis-video-streams-producer-sdk-cpp/blob/master/docs/structures.md gives an overview of callbacks in the SDK work generically.

Why are you saying it wouldn't compile? You can checkout the sample and see that it builds and runs just fine
continuousRetryStreamErrorReportHandler is an internal function that is not publicly available. I did check out your branch to confirm and it does not build for me.

Adjusting the timeout doesn't feel like an actual solution. I don't see anything in the adjustment to the sample I submitted that should make it not work and exit normally, so why are you suggesting that as a solution rather than fixing the logic problem/bug in the SDK that's preventing it from working like it should?

The reason we wait on the last persisted ack is because there isnt a way to inform the curl threads to stop unless an EOS is indicated. We will look into a better solution for this and see what we can do, but this is a suggestion if you need your application to exit.

daveisfera · 2023-09-26T23:50:28Z

I'm not following. If I stop without this error, then it exits as it should. The error is handled by the callback so shouldn't it exist in the same way?

disa6302 · 2023-09-27T20:45:31Z

Ok. Let me clarify.   

It wouldnt exit in the same way. In a happy situation, the SDK is receiving persisted acks for every fragment and the fragment is getting removed from the internal buffer when the persisted ack is received. Which means the SDK also receives the final Ack it is supposed to receive on time, so the stop() does not wait on it for long (its almost immediate). However, in case of an error situation (in this case MAX_FRAGMENT_DURATION_REACHED), the SDK gets into a retry state. The SDK ends up creating new upload handles trying to see if the fragment persistence succeeds.

When stop() is invoked while streaming, there are few conditions that the stop() succeeds without a timeout:

There are no frames in the buffer (which is happening in your case because you stop putting frames)
There are no ongoing upload sessions (which is not what is happening in your case)
An EOS is encountered AND there are no more frames in the buffer.

In your case, there are 2 mitigation options:

Set stopStream timeout to a really small value so it exits quickly
You can choose to disable the retry storm by setting pStreamInfo->streamCaps.recoverOnError = FALSE;. However, note, this would disable retry for any error encountered in the upload path. This is conjunction with stopping putFrames when an error is encountered will help in an early exit.

daveisfera · 2023-09-27T22:39:58Z

Is this going to be fixed? Retrying on network or server errors makes sense, but why is it retrying over and over on client errors that are just going to continuing failing?

disa6302 · 2023-09-28T21:17:08Z

We do not retry on the same fragment. We progress to the newer fragments to see if they succeed. The intent was to see if it recovers, it could be a flaky frame rate issue or could be a transient issue from the media source that is causing it as well. When you have an error callback set, it provides you the avenue to control the reaction to the different issues that could happen in the putMedia path.

daveisfera · 2023-09-28T21:24:46Z

There are no more frames being added after the error, so it should exit normally. Can you please the modified sample I submitted with the callback to show me how I'm supposed to control the reaction to these errors?

disa6302 · 2023-09-28T21:30:57Z

I think I clarified already what is happening with the stop logic. In your case, it is timing out. You stop your put frame calls but the stop() times out.

In your case, there are 2 mitigation options:

Set stopStream timeout to a really small value so it exits quickly
You can choose to not recover pStreamInfo->streamCaps.recoverOnError = FALSE;. However, note, this would disable retry for any error encountered in the upload path. This is conjunction with stopping putFrames when an error is encountered will help in an early exit.

Solution 2 would work in your case.

What is probably missing and the feasibility of having this feature still needs to be investigated, but we are missing a stopNow() kind of call which would allow stopping an ongoing streaming session.

daveisfera · 2023-09-28T21:39:40Z

Can you help me understand how this isn't a bug? An error response was received for the frame and there are no more frames being added, so why does it have to wait for a timeout?

Also, having it not recover on errors is opening up a HUGE can of worms because of network issues that that logic is properly working to handle, so I don't believe that either of these proposed solutions are valid and this really does feel like a bug.

disa6302 · 2023-09-28T23:19:19Z

@daveisfera ,

I did some more digging and went through your sample a little more closely and also internal SDK. There is one thing I noticed in your sample and fixing it might actually help fix your issue in the sample atleast. Our SDK goes through some state transitions when this error code is encountered. What the SDK ends up doing is it creates a new upload handle and attempts to stream on that. Although the g_error setting stops putting the frame in the while loop, it is likely that the thread sleep in the while loop is making the while loop exit a cycle late, which mean stop() gets invoked after frame duration time at worst within which internal state transitions have already happened and the states we are checking to allow the stop to complete are not valid anymore. Can you try to modify your sample to either execute stopSync() call in a thread (probably waiting in a while loop with a small delay) or maybe having a signaling based mechanism to immediately kick of stopStream instead of waiting frame duration time? I think this will cause your stopSync to return immediately.

daveisfera · 2023-09-28T23:29:18Z

I believe the simplest way to do what you're asking is to move the sleep to the top of the loop, so I did that, but the behavior was exactly the same. But to be honest, I don't see how the timing of those events should impact the results. The error comes through a callback so there's no way to guarantee anything in that regard.
So I think that leaves us at the some question of "can you help me understand why you believe this isn't a bug in the SDK?"

disa6302 · 2023-09-28T23:49:15Z

I would appreciate if you try my suggestion out, ok if you want to move the sleep to the start as well but then we would be running putFrame again, as long as you handle that part as well, it should be fine, but I would like a confirmation of my theory. The goal is to ensure stop() is called as soon as the error happens without waiting an entire frame duration. Also, the logs you attached do not seem to have Handling stream in custom callback line which suggests your callbacks are not getting invoked. Maybe it is because you were using it incorrectly as per our previous discussion. Please attach new logs with my suggestions.

I would still not call it a bug because the SDK potentially never got the request in the time it checks for the stop stream request. SDK does not have a specific thread to monitor if a stop request is coming in. I think there can be some improvements that can be done to better solve a timing related issue that you are most likely hitting and ensure we stop in a potentially streaming state, but the gist of the issue as per my reading is the SDK does not get the Stop request in the time frame it checks for it. As I explained in a few messages above, there are a set of conditions the SDK checks for to allow stopping the stream and if it does not see those happen, it is going to timeout.

I would be happy to answer any more questions you might have with respect to modifying the application so that we can get a new set of logs for analyzing the situation.

If it still does not solve the issue, I can dig more, but, mostly what I want to confirm out of this exercise is that stopSync() is not returning because of timing.

daveisfera · 2023-09-29T00:09:16Z

So you want me to call the stop in the callback? How will that work and how do I ensure that it won't be called multiple times?

And what you're saying isn't true. The stop won't be called until the callback happens and g_error is set, so the timing you're suggesting is impossible.

Also, can't you try it with the sample I've provided and generate the logs yourself? Me doing my best to interpret what you've suggested and then sending logs feels like a poor substitute for you doing it yourself with the sample I've provided.

disa6302 · 2023-09-29T00:19:11Z

I am taking a look at your sample and providing suggestions and I want to confirm what you are seeing as well :)
Ofcourse I can run the sample myselves and I did a quick change to confirm whats happening on my system, but that was not the point. I needed to see the logs from your setup :) If the code flow I am suggesting is unclear, I can push a branch for you to take a look at.

The timing explanation I am suggesting is also absolutely possible and I am explaining it exactly as I see in your sample. The stop would not be called until the callback happens, yes, but the while loop would also not exit unless it checks again after frame duration. And within the frame duration, there is a lot that goes on within the SDK.

disa6302 · 2023-09-29T00:26:10Z

https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/tree/callback-sample --- I have added the simple sample here. Let me know if this works for you and if it does, it is a timing related issue.

daveisfera · 2023-09-29T01:08:56Z

Here's a Dockerfile that you can use to build the SDK in a simple/reproducible way. I build it there and then run it inside of a container using that image:

# syntax = docker/dockerfile:1.5
FROM debian:bookworm-slim

RUN --mount=type=cache,target=/var/cache/apt/,sharing=locked --mount=type=cache,target=/var/lib/apt/lists/,sharing=locked \
    apt-get update && \
    apt-get install -y --no-install-recommends \
    automake \
    ca-certificates \
    cmake \
    g++ \
    gcc \
    git \
    libcurl4-openssl-dev \
    libssl-dev \
    make \
    pkg-config

ENV KVS_SDK_VERSION=1.5.1
RUN cd /opt && \
    git clone --depth 1 --branch "v$KVS_SDK_VERSION" https://github.com/awslabs/amazon-kinesis-video-streams-producer-c.git

WORKDIR /opt/amazon-kinesis-video-streams-producer-c/build/
RUN cmake .. -DBUILD_DEPENDENCIES=OFF -DCMAKE_INSTALL_PREFIX=/usr && \
    make -j $(nproc)

daveisfera · 2023-09-29T01:24:56Z

https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/tree/callback-sample --- I have added the simple sample here. Let me know if this works for you and if it does, it is a timing related issue.

This version does not timeout when I run it but this error is output:

curlApiCallbacksShutdown called when already in progress of shutting down

disa6302 · 2023-09-29T15:59:59Z

Glad it worked! And thanks for sharing your Docker script. Appreciate it.

curlApiCallbacksShutdown called when already in progress of shutting down is not an error, its a warning and it is ok. It basically means that this call was made again while curl shutdown was already being handled. We do not shut down twice because of this check and that is good.

daveisfera · 2023-09-29T16:12:01Z

Ok, but what does that mean for an actual solution? I'm running this with video coming in and the functions only run when a frame comes in, so there's no way to short circuit those calls like you did here

disa6302 · 2023-09-29T16:17:44Z

The application I modified was the application you provided. Not sure what you mean by short circuiting, but I am merely running stop in a thread based on the global you had in your sample.

I think we are getting into the architecture of your application which I am not familiar with. If you could provide more context, I might be able to provide some suggestions and see what could work for you.

disa6302 · 2023-09-29T19:36:42Z

@daveisfera ,

One more thing I missed suggesting, and might be worth looking into in your architecture is to reset the stream with kinesisVideoStreamResetStream (which you seem to be doing in your minimal application you provided), but do it before stopping the stream instead of after. What this call does is reset the state machine and drop all the remaining frames in the buffer (in case there are any left in the buffer). This will help in getting into the ideal state for stop() to resolve quickly.

daveisfera · 2023-10-02T15:34:49Z

The problem with using the thread is that it's not guaranteed to solve the problem. The timing of the execution of that thread could be delayed because of busy resources or scheduling being backed up.

The architecture of my application is identical to the sample but rather than the sleep call, the read is a blocking network read that returns when the next frame is available. So the execution is basically identical and can you help me understand why this isn't mishandling of the error in the KVS Producer SDK that should be fixed there?

disa6302 · 2023-10-02T15:39:05Z

There are some improvements I have identified that can be made along the stop logic. One thing I noticed in this debugging exercise is we do not allow stopping in streaming state if we receive a stop call despite the buffer being empty. I am looking for a way to fix this logic.

But for your application, resetStream along with stopStream should work. You dont need an additional thread for it. In your sample branch, you are resetting after stopStream. Just switch the order and the streaming should stop.

daveisfera · 2023-10-02T16:08:20Z

That change was in your sample and isn't in the example from the documentation ( https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producersdk-c-write.html ), so is there something happening incorrectly in the official sample/documentation? If so, can that be corrected?

disa6302 · 2023-10-02T16:21:23Z

resetStream is not something we really need to use every time. resetStream drops whatever frames are in the buffer and reset the internal state machine. Essentially you are in a clean state when you do a reset. Post reset, you could either stop or you could sending out new frames. Our samples are a happy case where all the frames are sent out and persisted acks are received on time.

We can look into adding some documentation for error cases, but the AWS docs are also a happy path sample.

daveisfera · 2023-10-02T16:28:47Z

You still haven't answered the original question of why this isn't a bug that needs to be fixed. A response is received from the server indicating an error happened, so every frame has been responded to and why does it keep trying and has to timeout rather than exiting gracefully?

disa6302 · 2023-10-02T16:38:18Z

#409 (comment) -- here is the answer. I have already addressed what is missing in the SDK handling and looking for a way to fix that. I have offered some solutions you can try out. This thread is long and I can understand it could be confusing. Please go through the discussion and let me know if you have any other questions.

daveisfera · 2023-10-03T15:01:50Z

Is there a way to use the SDK in a synchronous/blocking manner? It feels like there's a lot of complexity from the callbacks and such, so can it be used without that?

MushMal · 2023-10-03T16:10:54Z

This is a long thread so it would be great to summarize. Stopping the stream in a clean way is a careful orchestration. There is a need to do a last attempt to stream out the remainder of the buffer and receive the ACK (for durably persisted streaming), handle it in various conditions such as rotation or transient error and in general, the underlying network stack is adding a great complexity for various platforms. The complexity also comes from the SDK trying to always recover and not to get stuck. In order to keep the core of the SDK light, it doesn't have its own watch dog threads and relies on upper layer.

My suggestion would be to pinpoint the scenario in a unit test first which would provide a deterministic failure that can be addressed properly.

disa6302 · 2023-10-03T18:21:37Z

Summary of the discussion:

In the sample provided, it seems the internal buffers have all fragments responded to with some kind of ack and when the attempt is made to stop, it does not stop. As per [BUG] Using callbacks for error handling causes timeout #409 (comment), this is a bug that needs to be fixed and it is underway (the PR is still in works: https://github.com/awslabs/amazon-kinesis-video-streams-pic/pull/222/files)
In terms of stopping streaming while this fix is being made, a solution would be to reset the stream, followed by stopStream.

MushMal · 2023-10-03T18:37:36Z

Looking at the PR with my now limited knowledge... This area is one of the nastiest areas as it has to tame the quirks of the network stack, some timing and scenarios such as re-streaming. It would be very helpful to pinpoint the problem with a unit test which can have all the test setup, faults and the expected behavior clearly called out. Buffer gets trimmed on persisted ACKs, tail rolling in and skipping (inherently bad content). If the streaming is not configured with persistence then under normal circumstances, the view would only trim on the tail rolling.

zhangshouyin · 2023-10-06T14:08:34Z

你交叉编译的是什么平台？ARM还是MIPS？

daveisfera added the bug Something isn't working label Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using callbacks for error handling causes timeout #409

[BUG] Using callbacks for error handling causes timeout #409

daveisfera commented Sep 14, 2023

sharadrj commented Sep 25, 2023

disa6302 commented Sep 26, 2023

daveisfera commented Sep 26, 2023

disa6302 commented Sep 26, 2023

daveisfera commented Sep 26, 2023

disa6302 commented Sep 27, 2023 •

edited

Loading

daveisfera commented Sep 27, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023 •

edited

Loading

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023

disa6302 commented Sep 29, 2023 •

edited

Loading

daveisfera commented Sep 29, 2023

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023 •

edited

Loading

disa6302 commented Sep 29, 2023 •

edited

Loading

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023 •

edited

Loading

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023

daveisfera commented Oct 3, 2023

MushMal commented Oct 3, 2023

disa6302 commented Oct 3, 2023

MushMal commented Oct 3, 2023

zhangshouyin commented Oct 6, 2023

[BUG] Using callbacks for error handling causes timeout #409

[BUG] Using callbacks for error handling causes timeout #409

Comments

daveisfera commented Sep 14, 2023

sharadrj commented Sep 25, 2023

disa6302 commented Sep 26, 2023

daveisfera commented Sep 26, 2023

disa6302 commented Sep 26, 2023

daveisfera commented Sep 26, 2023

disa6302 commented Sep 27, 2023 • edited Loading

daveisfera commented Sep 27, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023

daveisfera commented Sep 28, 2023

disa6302 commented Sep 28, 2023 • edited Loading

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023

disa6302 commented Sep 29, 2023 • edited Loading

daveisfera commented Sep 29, 2023

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023

daveisfera commented Sep 29, 2023

disa6302 commented Sep 29, 2023 • edited Loading

disa6302 commented Sep 29, 2023 • edited Loading

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023 • edited Loading

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023

daveisfera commented Oct 2, 2023

disa6302 commented Oct 2, 2023

daveisfera commented Oct 3, 2023

MushMal commented Oct 3, 2023

disa6302 commented Oct 3, 2023

MushMal commented Oct 3, 2023

zhangshouyin commented Oct 6, 2023

disa6302 commented Sep 27, 2023 •

edited

Loading

disa6302 commented Sep 28, 2023 •

edited

Loading

disa6302 commented Sep 29, 2023 •

edited

Loading

disa6302 commented Sep 29, 2023 •

edited

Loading

disa6302 commented Sep 29, 2023 •

edited

Loading

disa6302 commented Oct 2, 2023 •

edited

Loading