Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in uploadHandle rotation with persisted ACKs causing timeout #278

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

grantralston
Copy link

@grantralston grantralston commented Feb 13, 2025

Issue
Our application utilizes the Gstreamer plugin kvssink from the Kinesis producer-c library. Our higher level application (Java) sends MKV containers into the gstreamer pipeline as available. They currently are available once every 10 seconds. We wait for the persisted ACK signal from kvssink before allowing the next fragment to be sent, as we have a use case for knowing 100% for sure that each fragment is persisted.

In aid of this, we send an EoFr frame at the end of each MKV cluster.

Recent behavior in regard to this was discovered frequently failing pipelines due to kvssink. On investigation it was discovered the root cause of this was uploadHandles timing out on stream token rotation. (Application uses IoT Credential Provider).

[java] 31 Jan 2025 15:46:49,125 [INFO] [] (MkvFileReaderSrc-worker) com.package.ClassX: Container Loaded - stream [id1-archive] - 1: [1738338403619] - 2: [1738338405619] - target ACK: [1738338407618]
[java] [kvssink] [INFO ] [31-01-2025 15:46:49:125.719 GMT] getStreamingEndpointResultEvent(): Get streaming endpoint result event.
[java] [kvssink] [INFO ] [31-01-2025 15:46:49:125.750 GMT] getStreamingTokenResultEvent(): Get streaming token result event.
[java] [kvssink] [INFO ] [31-01-2025 15:46:49:125.953 GMT] putStreamResultEvent(): Put stream result event. New upload handle 10
[java] [kvssink] [INFO ] [31-01-2025 15:46:49:185.257 GMT] writeHeaderCallback(): RequestId: f0890943-5703-6b32-909f-fbeae1fd6f62
[java] [kvssink] [INFO ] [31-01-2025 15:46:49:829.444 GMT] postReadCallback(): Reported end-of-stream for stream id1-archive. Upload handle: 9
[java] [kvssink] [WARN ] [31-01-2025 15:47:19:154.901 GMT] curlCompleteSync(): [id1-archive] curl perform failed for url https://s-x.kinesisvideo.us-east-1.amazonaws.com/putMedia with result Timeout was reached: Operation too slow. Less than 30 bytes/sec transferred the last 30 seconds
[java] [kvssink] [WARN ] [31-01-2025 15:47:19:154.977 GMT] curlCompleteSync(): [id1-archive] HTTP Error 0 : Response: (null)
[java] Request URL: https://s-x.kinesisvideo.us-east-1.amazonaws.com/putMedia
[java] Request Headers:
[java]     Authorization: AWS4-HMAC-SHA256 Credential=X/20250131/us-east-1/kinesisvideo/aws4_request, SignedHeaders=connection;host;transfer-encoding;user-agent;x-amz-date;x-amzn-fragment-acknowledgment-required;x-amzn-fragment-timecode-type;x-amzn-producer-start-timestamp;x-amzn-stream-name, Signature=X
[java] [kvssink] [WARN ] [31-01-2025 15:47:19:156.406 GMT] putStreamCurlHandler(): [id1-archive] Stream with streamHandle 140321213454512 uploadHandle 10 has exited without triggering end-of-stream. Service call result: 599
[java] [kvssink] [INFO ] [31-01-2025 15:47:19:156.414 GMT] kinesisVideoStreamTerminated(): Stream 0x7f9f141518b0 terminated upload handle 10 with service call result 599.
[java] [kvssink] [INFO ] [31-01-2025 15:47:19:156.572 GMT] putStreamResultEvent(): Put stream result event. New upload handle 11
[java] [kvssink] [WARN ] [31-01-2025 15:47:19:166.666 GMT] notifyDataAvailable(): [id1-archive] Failed to un-pause curl with error: 43. Curl object 0x7f9f9c0ab670
[java] [kvssink] [INFO ] [31-01-2025 15:47:21:703.700 GMT] writeHeaderCallback(): RequestId: fcf7467c-fb07-98b6-9ce1-b4b54df99879

Root cause was discovered in PIC code on token rotation, if all Persisted ACKs have been received and processed for the fragments that were handled by uploadHandle(N), it would enter state UPLOAD_HANDLE_STATE_TERMINATED and never pass through UPLOAD_HANDLE_STATE_AWAITING_ACK (appropriately). Though on getStreamData:CleanUp that would cause PIC to not hit the code path that pokes the subsequent handle uploadHandle(N+1). uploadHandle(N+1) will be stuck in UPLOAD_HANDLE_STATE_READY until it times out due to inactivity, and uploadHandle(N+2) starts up and successfully picks up data.

// when pUploadHandleInfo is in UPLOAD_HANDLE_STATE_TERMINATED or UPLOAD_HANDLE_STATE_AWAITING_ACK, it means current handle
// has sent all the data. So it's safe to unblock the next one if any. If WAIT_FOR_PERSISTED_ACK is enabled, no need
// to execute data available again when UPLOAD_HANDLE_STATE_TERMINATED because we did it already when upload handle
// was in UPLOAD_HANDLE_STATE_AWAITING_ACK state.
if (NULL != pUploadHandleInfo &&
    (pUploadHandleInfo->state == UPLOAD_HANDLE_STATE_AWAITING_ACK ||
    (!WAIT_FOR_PERSISTED_ACK(pKinesisVideoStr && pUploadHandleInfo->state == UPLOAD_HANDLE_STATE_TERMINATED))) {
    <poke next handle to pick up stream>

https://github.com/grantralston/amazon-kinesis-video-streams-pic/blob/c98c2a256a7bb3dfc4db41ae26d45e28ac7eec56/src/client/src/Stream.c#L1535

What was changed?
getStreamData:CleanUp path adjusted condition to trigger next upload handle poke when uploadHandle reaches UPLOAD_HANDLE_STATE_TERMINATED and has not poked the subsequent handle yet. This condition is tracked as a BOOL on the __UploadHandleInfo struct.

Why was it changed?
Fix was implemented so that edge case on stream token transition does not causes intermittent failures to pipeline and disruptions to streaming operations.

How was it changed?
getStreamData:CleanUp path adjusted condition to trigger next upload handle poke when uploadHandle reaches UPLOAD_HANDLE_STATE_TERMINATED and has not poked the subsequent handle yet. This condition is tracked as a BOOL on the __UploadHandleInfo struct.

What testing was done for the changes?

  • Commit was compiled into test version of our full application and deployed to lab environment with over 500 streams active. Failure mode was observed to be successfully fixed and no additional issues were uncovered.
  • PIC compiled with unit tests and run with ./tst/kvspic_test
  • Unit test implemented to cover scenario in question.
[==========] 2436 tests from 71 test suites ran. (1849852 ms total)
[  PASSED  ] 2435 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TimerQueueFunctionalityTest.kickTimerQueueTest

| TimerQueueFunctionalityTest.kickTimerQueueTest was failing prior to change.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant