-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Using callbacks for error handling causes timeout #409
Comments
@daveisfera , sorry we (KVS engineers) haven't been able to look at this ticket earlier, but it is now prioritized for this week. We will get back with some follow-ups. |
Quick clarification on the callback structure within the SDK. Client callbacks are a higher level object for all the streams that the application is setting up, while stream callbacks are specific to a particular stream. If you check out default implementation for client error callback, we invoke the stream error report callbacks for every stream within a for loop.
I am also seeing this in your branch:
This technically would not even build unless you are somehow moving this define into the public header file. You will probably not even need to set up a client callback provider unless there is specific handling you would like to do in the callback common to all the streams. Your set up could just set up the stream callbacks which get invoked here:
Now, coming to why it takes a long time to exit the application: |
Is there documentation of how this is should be used? I was just doing my best to guess how it would work and apparently was doing it wrong
Why are you saying it wouldn't compile? You can checkout the sample and see that it builds and runs just fine
Adjusting the timeout doesn't feel like an actual solution. I don't see anything in the adjustment to the sample I submitted that should make it not work and exit normally, so why are you suggesting that as a solution rather than fixing the logic problem/bug in the SDK that's preventing it from working like it should? |
The reason we wait on the last persisted ack is because there isnt a way to inform the curl threads to stop unless an EOS is indicated. We will look into a better solution for this and see what we can do, but this is a suggestion if you need your application to exit. |
I'm not following. If I stop without this error, then it exits as it should. The error is handled by the callback so shouldn't it exist in the same way? |
Ok. Let me clarify. It wouldnt exit in the same way. In a happy situation, the SDK is receiving persisted acks for every fragment and the fragment is getting removed from the internal buffer when the persisted ack is received. Which means the SDK also receives the final Ack it is supposed to receive on time, so the stop() does not wait on it for long (its almost immediate). However, in case of an error situation (in this case When stop() is invoked while streaming, there are few conditions that the stop() succeeds without a timeout:
In your case, there are 2 mitigation options:
|
Is this going to be fixed? Retrying on network or server errors makes sense, but why is it retrying over and over on client errors that are just going to continuing failing? |
We do not retry on the same fragment. We progress to the newer fragments to see if they succeed. The intent was to see if it recovers, it could be a flaky frame rate issue or could be a transient issue from the media source that is causing it as well. When you have an error callback set, it provides you the avenue to control the reaction to the different issues that could happen in the putMedia path. |
There are no more frames being added after the error, so it should exit normally. Can you please the modified sample I submitted with the callback to show me how I'm supposed to control the reaction to these errors? |
I think I clarified already what is happening with the stop logic. In your case, it is timing out. You stop your put frame calls but the stop() times out. In your case, there are 2 mitigation options:
Solution 2 would work in your case. What is probably missing and the feasibility of having this feature still needs to be investigated, but we are missing a stopNow() kind of call which would allow stopping an ongoing streaming session. |
Can you help me understand how this isn't a bug? An error response was received for the frame and there are no more frames being added, so why does it have to wait for a timeout? Also, having it not recover on errors is opening up a HUGE can of worms because of network issues that that logic is properly working to handle, so I don't believe that either of these proposed solutions are valid and this really does feel like a bug. |
I did some more digging and went through your sample a little more closely and also internal SDK. There is one thing I noticed in your sample and fixing it might actually help fix your issue in the sample atleast. Our SDK goes through some state transitions when this error code is encountered. What the SDK ends up doing is it creates a new upload handle and attempts to stream on that. Although the g_error setting stops putting the frame in the while loop, it is likely that the thread sleep in the while loop is making the while loop exit a cycle late, which mean stop() gets invoked after frame duration time at worst within which internal state transitions have already happened and the states we are checking to allow the stop to complete are not valid anymore. Can you try to modify your sample to either execute stopSync() call in a thread (probably waiting in a while loop with a small delay) or maybe having a signaling based mechanism to immediately kick of stopStream instead of waiting frame duration time? I think this will cause your stopSync to return immediately. |
I believe the simplest way to do what you're asking is to move the sleep to the top of the loop, so I did that, but the behavior was exactly the same. But to be honest, I don't see how the timing of those events should impact the results. The error comes through a callback so there's no way to guarantee anything in that regard. |
I would appreciate if you try my suggestion out, ok if you want to move the sleep to the start as well but then we would be running putFrame again, as long as you handle that part as well, it should be fine, but I would like a confirmation of my theory. The goal is to ensure stop() is called as soon as the error happens without waiting an entire frame duration. Also, the logs you attached do not seem to have I would still not call it a bug because the SDK potentially never got the request in the time it checks for the stop stream request. SDK does not have a specific thread to monitor if a stop request is coming in. I think there can be some improvements that can be done to better solve a timing related issue that you are most likely hitting and ensure we stop in a potentially streaming state, but the gist of the issue as per my reading is the SDK does not get the Stop request in the time frame it checks for it. As I explained in a few messages above, there are a set of conditions the SDK checks for to allow stopping the stream and if it does not see those happen, it is going to timeout. I would be happy to answer any more questions you might have with respect to modifying the application so that we can get a new set of logs for analyzing the situation. If it still does not solve the issue, I can dig more, but, mostly what I want to confirm out of this exercise is that stopSync() is not returning because of timing. |
So you want me to call the stop in the callback? How will that work and how do I ensure that it won't be called multiple times? And what you're saying isn't true. The stop won't be called until the callback happens and Also, can't you try it with the sample I've provided and generate the logs yourself? Me doing my best to interpret what you've suggested and then sending logs feels like a poor substitute for you doing it yourself with the sample I've provided. |
I am taking a look at your sample and providing suggestions and I want to confirm what you are seeing as well :) The timing explanation I am suggesting is also absolutely possible and I am explaining it exactly as I see in your sample. The stop would not be called until the callback happens, yes, but the while loop would also not exit unless it checks again after frame duration. And within the frame duration, there is a lot that goes on within the SDK. |
https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/tree/callback-sample --- I have added the simple sample here. Let me know if this works for you and if it does, it is a timing related issue. |
Here's a
|
This version does not timeout when I run it but this error is output:
|
Glad it worked! And thanks for sharing your Docker script. Appreciate it.
|
Ok, but what does that mean for an actual solution? I'm running this with video coming in and the functions only run when a frame comes in, so there's no way to short circuit those calls like you did here |
The application I modified was the application you provided. Not sure what you mean by short circuiting, but I am merely running stop in a thread based on the global you had in your sample. I think we are getting into the architecture of your application which I am not familiar with. If you could provide more context, I might be able to provide some suggestions and see what could work for you. |
One more thing I missed suggesting, and might be worth looking into in your architecture is to reset the stream with kinesisVideoStreamResetStream (which you seem to be doing in your minimal application you provided), but do it before stopping the stream instead of after. What this call does is reset the state machine and drop all the remaining frames in the buffer (in case there are any left in the buffer). This will help in getting into the ideal state for stop() to resolve quickly. |
The problem with using the thread is that it's not guaranteed to solve the problem. The timing of the execution of that thread could be delayed because of busy resources or scheduling being backed up. The architecture of my application is identical to the sample but rather than the sleep call, the read is a blocking network read that returns when the next frame is available. So the execution is basically identical and can you help me understand why this isn't mishandling of the error in the KVS Producer SDK that should be fixed there? |
There are some improvements I have identified that can be made along the stop logic. One thing I noticed in this debugging exercise is we do not allow stopping in streaming state if we receive a stop call despite the buffer being empty. I am looking for a way to fix this logic. But for your application, |
That change was in your sample and isn't in the example from the documentation ( https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producersdk-c-write.html ), so is there something happening incorrectly in the official sample/documentation? If so, can that be corrected? |
We can look into adding some documentation for error cases, but the AWS docs are also a happy path sample. |
You still haven't answered the original question of why this isn't a bug that needs to be fixed. A response is received from the server indicating an error happened, so every frame has been responded to and why does it keep trying and has to timeout rather than exiting gracefully? |
#409 (comment) -- here is the answer. I have already addressed what is missing in the SDK handling and looking for a way to fix that. I have offered some solutions you can try out. This thread is long and I can understand it could be confusing. Please go through the discussion and let me know if you have any other questions. |
Is there a way to use the SDK in a synchronous/blocking manner? It feels like there's a lot of complexity from the callbacks and such, so can it be used without that? |
This is a long thread so it would be great to summarize. Stopping the stream in a clean way is a careful orchestration. There is a need to do a last attempt to stream out the remainder of the buffer and receive the ACK (for durably persisted streaming), handle it in various conditions such as rotation or transient error and in general, the underlying network stack is adding a great complexity for various platforms. The complexity also comes from the SDK trying to always recover and not to get stuck. In order to keep the core of the SDK light, it doesn't have its own watch dog threads and relies on upper layer. My suggestion would be to pinpoint the scenario in a unit test first which would provide a deterministic failure that can be addressed properly. |
Summary of the discussion:
|
Looking at the PR with my now limited knowledge... This area is one of the nastiest areas as it has to tame the quirks of the network stack, some timing and scenarios such as re-streaming. It would be very helpful to pinpoint the problem with a unit test which can have all the test setup, faults and the expected behavior clearly called out. Buffer gets trimmed on persisted ACKs, tail rolling in and skipping (inherently bad content). If the streaming is not configured with persistence then under normal circumstances, the view would only trim on the tail rolling. |
你交叉编译的是什么平台?ARM还是MIPS? |
Media pipeline:
Using a modification of the sample that handles errors with callbacks (as recommended previously)
Logging
From running normally (without custom callbacks):
logs_0.txt
From running with stream callback:
logs_1.txt
From running with client callback:
logs_2.txt
From running with both callbacks:
logs_3.txt
Describe the bug
Using the callbacks allows the user of the SDK to exit early rather than continuing to submit problematic data, but it doesn't exit properly and eventually is hard killed by a time out. Also, when the client callback is added, the stream callback is no longer called (you can see the lack of the expected error output in the logs run with
2
and3
above when the client callback is enabled), so this leads me to believe that there's an error in the callback handling somewhere.SDK version number
1.5.1
Open source building
N/A
To Reproduce
Steps to reproduce the behavior:
./kvsVideoOnlyRealtimeStreamingSample dljtest h265 60 "" 3
Expected behavior
Callbacks can be used to handle errors and the SDK still exits normally when using them
Desktop (please complete the following information):
Additional context
I've been told after a lot of issues that everything is totally fine and then that my not using callbacks was the source of the issue, but this appears to be one more example of there being something wrong in the SDK...
The text was updated successfully, but these errors were encountered: