Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngclient: support StorageBackendInterface? #2676

Open
woodruffw opened this issue Jul 18, 2024 · 2 comments
Open

ngclient: support StorageBackendInterface? #2676

woodruffw opened this issue Jul 18, 2024 · 2 comments

Comments

@woodruffw
Copy link
Contributor

Description of issue or feature request:

Right now, tuf.ngclient is heavily tied to local system I/O: it assumes a metadata directory on disk that can be read/written. For example:

def _persist_metadata(self, rolename: str, data: bytes) -> None:
"""Write metadata to disk atomically to avoid data loss."""
temp_file_name: Optional[str] = None
try:
# encode the rolename to avoid issues with e.g. path separators
encoded_name = parse.quote(rolename, "")
filename = os.path.join(self._dir, f"{encoded_name}.json")
with tempfile.NamedTemporaryFile(
dir=self._dir, delete=False
) as temp_file:
temp_file_name = temp_file.name
temp_file.write(data)
os.replace(temp_file.name, filename)
except OSError as e:
# remove tempfile if we managed to create one,
# then let the exception happen
if temp_file_name is not None:
with contextlib.suppress(FileNotFoundError):
os.remove(temp_file_name)
raise e

This is problematic in distributed worker setups like Warehouse (PyPI), where each worker has its own container/entire VM and thus can't easily share on-disk TUF repos. In particular, this causes both reliability and security concerns:

  • Reliability: an unfortunate corruption in a single worker's TUF repo results in a hard-to-diagnose flaky worker, since each worker has its own copy of the repo.
  • Security: each worker's TUF repo is independently stored on a (machine-local) disk, making them harder to audit.

This problem was noted a few years back, before tuf.ngclient was created: #1009. The solution then was to add a filesystem abstraction to the tuf.metadata APIs, which was done via secure-systems-lab/securesystemslib#232 and #1009. However, this abstraction wasn't added to the ngclient APIs, only to the low-level metadata ones.

Current behavior:

tuf.ngclient currently assumes that it can perform persistent local I/O for its repository.

Expected behavior:

tuf.ngclient should support an I/O abstraction (such as the pre-existing StorageBackendInterface, if suitable) for persistent repo operations, enabling use in distributed deployments.

@jku
Copy link
Member

jku commented Sep 23, 2024

I think the expected behaviour sounds reasonable.

There is a related question to consider -- in a scenario where you have "distributed workers", maybe what you really want is a bunch of "read-only" workers that operate without ever connecting to the repository (at least for metadata), and one writing tuf client that actually does the updates at regular intervals.

Previously we tried to make an offline mode that would be use friendly -- usable by CLI apps -- and that turned out complicated (compared to the potential advantages). The "offline mode" described above (where it's ok to just immediately fail if the local metadata is not up-to-date and someone promises to keep it updated) would be simple to add.

"dumb read-only mode" or IO abstraction (or both) sound like things that could be added as optional features to ngclient.

  • Abstracting the metadata IO should be straightforward: something still needs to take care of filename encoding but nothing should not be visible to API user (apart from the added optional argument for StorageBackendInterface or something)
  • Abstracting IO in find_cached_target and download_target should work as well, we'll just need to make sure the optional filepath argument still makes sense -- likely that only makes sense with the default filesystem implementation

@woodruffw
Copy link
Contributor Author

There is a related question to consider -- in a scenario where you have "distributed workers", maybe what you really want is a bunch of "read-only" workers that operate without ever connecting to the repository (at least for metadata), and one writing tuf client that actually does the updates at regular intervals

Thanks for extrapolating this! This is indeed the underlying scenario, and probably is a more accurate encapsulation of what I actually need 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants