From f01bfa35451abe8afeadfcc49e3450a3f5e54c4a Mon Sep 17 00:00:00 2001
From: Reed <reedwm@google.com>
Date: Mon, 24 Oct 2022 16:17:33 -0700
Subject: [PATCH 1/4] Create 20221024-sync-devices.md

---
 rfcs/20221024-sync-devices.md | 91 +++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 rfcs/20221024-sync-devices.md

diff --git a/rfcs/20221024-sync-devices.md b/rfcs/20221024-sync-devices.md
new file mode 100644
index 000000000..1b058ab60
--- /dev/null
+++ b/rfcs/20221024-sync-devices.md
@@ -0,0 +1,91 @@
+# API to synchronize devices
+
+| Status        | Proposed                                             |
+:-------------- |:---------------------------------------------------- |
+| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)|
+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com), Jonathan Dekhtiar (jdekhtiar@nvidia.com) |
+| **Sponsor**   |                                                      |
+| **Updated**   | 2022-10-24                                           |
+
+## Objective
+
+This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.
+
+## Motivation
+
+TensorFlow runs GPU ops asynchronously. This means when a user calls an op, the op will return to the user before the GPU actually finishes computing the output. Unfortunately, this causes issues when measuring performance during a benchmark. For example, the following program tries to measure how long a matrix multiplication takes, but significantly underestimates the time taken because the matmul is still running asynchronously on the GPU even after `tf.linalg.matmul` returns.
+
+```python
+start = time.time()
+y = tf.linalg.matmul(x, x)
+print(f'Time taken: {time.time() - start}')
+```
+
+This can be fixed by calling `y.numpy()` which forces the Python thread to wait until the matmul finishes, but this also adds a device-to-host transfer. The benchmark only wants to measure the matmul time, not the device transfer time. 
+
+In the example above, only a single matmul is called, but real-world examples will run entire models with many ops. Still, the same issue applies: even after the user calls the Python functions to run the ops in their model (or calls a single `tf.function` wrapping their model), these ops will not necessarily have all finished running after the Python functions have returned.
+
+Non-GPU ops also can be made to run asynchronously with the [`tf.config.experimental.set_synchronous_execution`](https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_synchronous_execution) API, in which case the same problem applies to non-GPU ops.
+
+## User Benefit
+
+Users will be able to accurately measure the execution time of a TensorFlow model or benchmark.
+
+## Design Proposal
+
+The function `tf.sync_devices()` will be added, which synchronizes all asynchronous devices. The function takes no arguments and has no return value. The function blocks the currently running Python thread, and when the function returns, all work that was queued at the start of the call will have finished.
+
+Only GPUs are asynchronous by default (and asynchronous pluggable devices), but all devices are run asynchronously if the user calls `tf.config.experimental.set_synchronous_execution(False)`. In both cases, `tf.sync_devices()` synchronizes all relevant devices.
+
+## Detailed Design
+
+There are two sources of asynchronous op execution in TensorFlow:
+
+1. GPU ops enqueue work in a CUDA stream, which runs asynchronously. The [`Stream::BlockHostUntilDone`](https://github.com/tensorflow/tensorflow/blob/3e25aa44bcc6bddf8c0a908934eb1c3823299ccb/tensorflow/compiler/xla/stream_executor/stream.h#L1407) C++ method synchronizes the GPU’s CUDA stream.
+2. All ops can be made to run asynchronously by calling `tf.config.experimental.set_synchronous_execution(False)`, in which case TensorFlow maintains one or more background threads to asynchronously run ops. The internal [`async_wait`](https://github.com/tensorflow/tensorflow/blob/3e25aa44bcc6bddf8c0a908934eb1c3823299ccb/tensorflow/python/eager/context.py#L2660) function synchronizes these background threads.
+
+`tf.sync_devices` will synchronize both sources of asynchrony. To address (1), an op will be added, `SyncDevice`, which on GPUs synchronizes the GPU by calling `Stream::BlockHostUntilDone`. `tf.sync_devices` will enumerate all devices with `tf.config.list_logical_devices()` and run the `SyncDevice` op on each. To address (2), `tf.sync_devices` will also call the `async_wait` function.
+
+There already exists in the TensorFlow API a context manager [`tf.experimental.async_scope`](https://www.tensorflow.org/api_docs/python/tf/experimental/async_scope), which enables asynchrony source (2) mentioned above when entered. When exited, the context manager disables the asynchrony source (2) and additionally calls `async_wait` to synchronize TensorFlow’s background threads. However, the context manager does not synchronize source (1), the CUDA streams, as `tf.sync_devices` does, and therefore GPU ops could still be pending in a CUDA stream when `tf.experimental.async_scope` exits.
+
+`tf.sync_devices` can only be called in Eager mode, outside `tf.function`s. TensorFlow sessions synchronize automatically at the end of `Session.run`, so this API is only useful in TensorFlow 2.
+
+
+### Alternatives Considered
+
+The API could sync a single device, taking in a device string: `tf.sync_device('GPU:0')`. The issue with this is that with TensorFlow’s asynchronous execution, there is only a single background thread per host running ops, so there is no way to synchronize a single device when the user calls `tf.config.experimental.set_synchronous_execution(False)`. This API is also slightly more complicated, taking in a mandatory argument.
+
+Another possibility is to add a synchronized method to individual tensors, similar to JAX’s [`block_until_ready` array method](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html). This has the same issue as above: There is no way to synchronize a single device, let alone a single tensor.
+
+### Performance Implications
+
+There will be no performance impact on models and benchmarks which do not call `tf.sync_devices`. Calling `tf.sync_devices` in a microbenchmark is necessary to accurately measure performance. Excessively calling `tf.sync_devices` will reduce performance, but this is by design, as synchronization has an inherent cost.
+
+### Dependencies
+
+No new dependencies are added.
+
+### Engineering Impact
+
+There will be a negligible impact on binary size, startup time, build time, and test time. The amount of code added will be very small, making maintenance easy.
+
+### Platforms and Environments
+
+Of the three officially supported platforms in TensorFlow (CPUs, GPUs, and TPUs), only GPUs are asynchronous by default, and so `tf.sync_devices` only affects GPUs by default. The function `tf.config.experimental.set_synchronous_execution` can make all devices asynchronous, in which case `tf.sync_devices` affects all three platforms. Custom devices which are asynchronous by default will need to implement the `SyncDevice` op for `tf.sync_devices` to work correctly
+
+### Best Practices
+
+For benchmarks, the best practice will be to call `tf.sync_devices` right before calling `time.time()` (or some other time measurement function) to get the execution time of the benchmark. This will be documented in the `tf.sync_devices` docstring.
+
+### Tutorials and Examples
+
+The docstring of `tf.sync_devices()` will describe how to use it with examples. We can later consider adding a page describing asynchronous execution in general, similar to JAX’s [Asynchronous dispatch](https://jax.readthedocs.io/en/latest/async_dispatch.html) page.
+
+
+### Compatibility
+
+`tf.sync_devices()` will be initially added as `tf.experimental.sync_devices()`, which means the API will not be covered by backwards compatibility guarantees. We do not expect to make breaking changes to the API however.
+
+### User Impact
+
+The only user-facing change is that `tf.sync_devices` will be added.

From 43f238a20629c6d3d4ef87f9a8f432202992a602 Mon Sep 17 00:00:00 2001
From: Reed <reedwm@google.com>
Date: Tue, 25 Oct 2022 15:55:39 -0700
Subject: [PATCH 2/4] Update sponsor and date

---
 rfcs/{20221024-sync-devices.md => 20221025-sync-devices.md} | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
 rename rfcs/{20221024-sync-devices.md => 20221025-sync-devices.md} (98%)

diff --git a/rfcs/20221024-sync-devices.md b/rfcs/20221025-sync-devices.md
similarity index 98%
rename from rfcs/20221024-sync-devices.md
rename to rfcs/20221025-sync-devices.md
index 1b058ab60..66adf2ae6 100644
--- a/rfcs/20221024-sync-devices.md
+++ b/rfcs/20221025-sync-devices.md
@@ -4,8 +4,8 @@
 :-------------- |:---------------------------------------------------- |
 | **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)|
 | **Author(s)** | Reed Wanderman-Milne (reedwm@google.com), Jonathan Dekhtiar (jdekhtiar@nvidia.com) |
-| **Sponsor**   |                                                      |
-| **Updated**   | 2022-10-24                                           |
+| **Sponsor**   | Rohan Jain (rohanj@google.com)                       |
+| **Updated**   | 2022-10-25                                           |
 
 ## Objective
 

From 1fbc2877e154973cbc37d0405e94cb18852e67cd Mon Sep 17 00:00:00 2001
From: Reed <reedwm@google.com>
Date: Tue, 25 Oct 2022 16:15:41 -0700
Subject: [PATCH 3/4] Update RFC number

---
 rfcs/20221025-sync-devices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/20221025-sync-devices.md b/rfcs/20221025-sync-devices.md
index 66adf2ae6..ac434db8d 100644
--- a/rfcs/20221025-sync-devices.md
+++ b/rfcs/20221025-sync-devices.md
@@ -2,7 +2,7 @@
 
 | Status        | Proposed                                             |
 :-------------- |:---------------------------------------------------- |
-| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)|
+| **RFC #**     | [434](https://github.com/tensorflow/community/pull/434) |
 | **Author(s)** | Reed Wanderman-Milne (reedwm@google.com), Jonathan Dekhtiar (jdekhtiar@nvidia.com) |
 | **Sponsor**   | Rohan Jain (rohanj@google.com)                       |
 | **Updated**   | 2022-10-25                                           |

From 0a41c61d95c3ba921f966c7b8a5a4aef5c123c9a Mon Sep 17 00:00:00 2001
From: Reed <reedwm@google.com>
Date: Wed, 11 Jan 2023 17:55:40 -0800
Subject: [PATCH 4/4] Rename tf.sync_devices to tf.test.sync_devices

---
 rfcs/20221025-sync-devices.md | 36 ++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/rfcs/20221025-sync-devices.md b/rfcs/20221025-sync-devices.md
index ac434db8d..2138b23e7 100644
--- a/rfcs/20221025-sync-devices.md
+++ b/rfcs/20221025-sync-devices.md
@@ -9,7 +9,7 @@
 
 ## Objective
 
-This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.
+This document proposes a simple API to synchronize TensorFlow devices: `tf.test.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.
 
 ## Motivation
 
@@ -33,33 +33,36 @@ Users will be able to accurately measure the execution time of a TensorFlow mode
 
 ## Design Proposal
 
-The function `tf.sync_devices()` will be added, which synchronizes all asynchronous devices. The function takes no arguments and has no return value. The function blocks the currently running Python thread, and when the function returns, all work that was queued at the start of the call will have finished.
+The function `tf.test.sync_devices()` will be added, which synchronizes all asynchronous devices. The function takes no arguments and has no return value. The function blocks the currently running Python thread, and when the function returns, all work that was queued at the start of the call will have finished.
 
-Only GPUs are asynchronous by default (and asynchronous pluggable devices), but all devices are run asynchronously if the user calls `tf.config.experimental.set_synchronous_execution(False)`. In both cases, `tf.sync_devices()` synchronizes all relevant devices.
+Only GPUs are asynchronous by default (and asynchronous pluggable devices), but all devices are run asynchronously if the user calls `tf.config.experimental.set_synchronous_execution(False)`. In both cases, `tf.test.sync_devices()` synchronizes all relevant devices.
 
 ## Detailed Design
 
 There are two sources of asynchronous op execution in TensorFlow:
 
-1. GPU ops enqueue work in a CUDA stream, which runs asynchronously. The [`Stream::BlockHostUntilDone`](https://github.com/tensorflow/tensorflow/blob/3e25aa44bcc6bddf8c0a908934eb1c3823299ccb/tensorflow/compiler/xla/stream_executor/stream.h#L1407) C++ method synchronizes the GPU’s CUDA stream.
+1. GPU ops enqueue work in a CUDA stream, which runs asynchronously. The [`Stream::BlockHostUntilDone`](https://github.com/tensorflow/tensorflow/blob/3e25aa44bcc6bddf8c0a908934eb1c3823299ccb/tensorflow/compiler/xla/stream_executor/stream.h#L1407) C++ method synchronizes the GPU's CUDA stream.
 2. All ops can be made to run asynchronously by calling `tf.config.experimental.set_synchronous_execution(False)`, in which case TensorFlow maintains one or more background threads to asynchronously run ops. The internal [`async_wait`](https://github.com/tensorflow/tensorflow/blob/3e25aa44bcc6bddf8c0a908934eb1c3823299ccb/tensorflow/python/eager/context.py#L2660) function synchronizes these background threads.
 
-`tf.sync_devices` will synchronize both sources of asynchrony. To address (1), an op will be added, `SyncDevice`, which on GPUs synchronizes the GPU by calling `Stream::BlockHostUntilDone`. `tf.sync_devices` will enumerate all devices with `tf.config.list_logical_devices()` and run the `SyncDevice` op on each. To address (2), `tf.sync_devices` will also call the `async_wait` function.
+`tf.test.sync_devices` will synchronize both sources of asynchrony. To address (1), an op will be added, `SyncDevice`, which on GPUs synchronizes the GPU by calling `Stream::BlockHostUntilDone`. `tf.test.sync_devices` will enumerate all devices with `tf.config.list_logical_devices()` and run the `SyncDevice` op on each. To address (2), `tf.test.sync_devices` will also call the `async_wait` function.
 
-There already exists in the TensorFlow API a context manager [`tf.experimental.async_scope`](https://www.tensorflow.org/api_docs/python/tf/experimental/async_scope), which enables asynchrony source (2) mentioned above when entered. When exited, the context manager disables the asynchrony source (2) and additionally calls `async_wait` to synchronize TensorFlow’s background threads. However, the context manager does not synchronize source (1), the CUDA streams, as `tf.sync_devices` does, and therefore GPU ops could still be pending in a CUDA stream when `tf.experimental.async_scope` exits.
-
-`tf.sync_devices` can only be called in Eager mode, outside `tf.function`s. TensorFlow sessions synchronize automatically at the end of `Session.run`, so this API is only useful in TensorFlow 2.
+There already exists in the TensorFlow API a context manager [`tf.experimental.async_scope`](https://www.tensorflow.org/api_docs/python/tf/experimental/async_scope), which enables asynchrony source (2) mentioned above when entered. When exited, the context manager disables the asynchrony source (2) and additionally calls `async_wait` to synchronize TensorFlow's background threads. However, the context manager does not synchronize source (1), the CUDA streams, as `tf.test.sync_devices` does, and therefore GPU ops could still be pending in a CUDA stream when `tf.experimental.async_scope` exits.
 
+`tf.test.sync_devices` can only be called in Eager mode, outside `tf.function`s. TensorFlow sessions synchronize automatically at the end of `Session.run`, so this API is only useful in TensorFlow 2.
 
 ### Alternatives Considered
 
-The API could sync a single device, taking in a device string: `tf.sync_device('GPU:0')`. The issue with this is that with TensorFlow’s asynchronous execution, there is only a single background thread per host running ops, so there is no way to synchronize a single device when the user calls `tf.config.experimental.set_synchronous_execution(False)`. This API is also slightly more complicated, taking in a mandatory argument.
+The API could sync a single device, taking in a device string: `tf.test.sync_device('GPU:0')`. The issue with this is that with TensorFlow's asynchronous execution, there is only a single background thread per host running ops, so there is no way to synchronize a single device when the user calls `tf.config.experimental.set_synchronous_execution(False)`. This API is also slightly more complicated, taking in a mandatory argument.
+
+Another possibility is to add a synchronized method to individual tensors, similar to JAX's [`block_until_ready` array method](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html). This has the same issue as above: There is no way to synchronize a single device, let alone a single tensor.
 
-Another possibility is to add a synchronized method to individual tensors, similar to JAX’s [`block_until_ready` array method](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html). This has the same issue as above: There is no way to synchronize a single device, let alone a single tensor.
+### API Name
+
+The API is under the `tf.test` namespace because the primary use for `tf.test.sync_devices` is measuring performance during a benchmark, and benchmarking is a form of testing. The API will initially be marked as `experimental` by having the API symbol being `tf.test.experimental.sync_devices`.
 
 ### Performance Implications
 
-There will be no performance impact on models and benchmarks which do not call `tf.sync_devices`. Calling `tf.sync_devices` in a microbenchmark is necessary to accurately measure performance. Excessively calling `tf.sync_devices` will reduce performance, but this is by design, as synchronization has an inherent cost.
+There will be no performance impact on models and benchmarks which do not call `tf.test.sync_devices`. Calling `tf.sync_devices` in a microbenchmark is necessary to accurately measure performance. Excessively calling `tf.test.sync_devices` will reduce performance, but this is by design, as synchronization has an inherent cost.
 
 ### Dependencies
 
@@ -71,21 +74,20 @@ There will be a negligible impact on binary size, startup time, build time, and
 
 ### Platforms and Environments
 
-Of the three officially supported platforms in TensorFlow (CPUs, GPUs, and TPUs), only GPUs are asynchronous by default, and so `tf.sync_devices` only affects GPUs by default. The function `tf.config.experimental.set_synchronous_execution` can make all devices asynchronous, in which case `tf.sync_devices` affects all three platforms. Custom devices which are asynchronous by default will need to implement the `SyncDevice` op for `tf.sync_devices` to work correctly
+Of the three officially supported platforms in TensorFlow (CPUs, GPUs, and TPUs), only GPUs are asynchronous by default, and so `tf.test.sync_devices` only affects GPUs by default. The function `tf.config.experimental.set_synchronous_execution` can make all devices asynchronous, in which case `tf.test.sync_devices` affects all three platforms. Custom devices which are asynchronous by default will need to implement the `SyncDevice` op for `tf.test.sync_devices` to work correctly
 
 ### Best Practices
 
-For benchmarks, the best practice will be to call `tf.sync_devices` right before calling `time.time()` (or some other time measurement function) to get the execution time of the benchmark. This will be documented in the `tf.sync_devices` docstring.
+For benchmarks, the best practice will be to call `tf.test.sync_devices` right before calling `time.time()` (or some other time measurement function) to get the execution time of the benchmark. This will be documented in the `tf.test.sync_devices` docstring.
 
 ### Tutorials and Examples
 
-The docstring of `tf.sync_devices()` will describe how to use it with examples. We can later consider adding a page describing asynchronous execution in general, similar to JAX’s [Asynchronous dispatch](https://jax.readthedocs.io/en/latest/async_dispatch.html) page.
-
+The docstring of `tf.test.sync_devices()` will describe how to use it with examples. We can later consider adding a page describing asynchronous execution in general, similar to JAX's [Asynchronous dispatch](https://jax.readthedocs.io/en/latest/async_dispatch.html) page.
 
 ### Compatibility
 
-`tf.sync_devices()` will be initially added as `tf.experimental.sync_devices()`, which means the API will not be covered by backwards compatibility guarantees. We do not expect to make breaking changes to the API however.
+`tf.test.sync_devices()` will be initially added as `tf.test.experimental.sync_devices()`, which means the API will not be covered by backwards compatibility guarantees. We do not expect to make breaking changes to the API however.
 
 ### User Impact
 
-The only user-facing change is that `tf.sync_devices` will be added.
+The only user-facing change is that `tf.test.sync_devices` will be added.