-
Notifications
You must be signed in to change notification settings - Fork 18
Add skopeo-based adapter for working with OCI images #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
These parts will be useful for the upcoming skopeo adapter as well.
"get" is probably clearer than "list", and tacking on "_ids" makes it clearer what the return value is. Also, drop the leading underscore, which is a holdover from the function being in the adapters.docker module.
The minimum Python version of DataLad is new enough that we can assume subprocess.run() is available. It's recommended by the docs, and I like it more, so switch to it. Note that we might want to eventually switch to using WitlessRunner here. The original idea with using the subprocess module directly was that it'd be nice for the docker adapter to be standalone, as nothing in the adapter depended on datalad at the time. That's not the case anymore after the adapters.utils split and the use of datalad.utils within it. (And the upcoming skopeo adapter will make heavier use of datalad for adding URLs to the layers.)
This logic will get a bit more involved in the next commit, and it will be needed by the skopeo adapter too.
When the adapter is called from the command line (as containers-run does) and datalad gets imported, the level set via the --verbose argument doesn't have an effect and logging happens twice, once through datalad's handler and once through the adapter's. Before 313c4f0 (WIN/Workaround: don't pass gid and uid to docker run call, 2020-11-10), the above was the case when docker.main() was triggered with the documented `python -m datalad_container.adapters ...` invocation, but not when the script path was passed to python. Following that commit, the adapter imports datalad, so datalad's logger is always configured. Adjust setup_logger() to set the log level of loggers under the datalad.containers.adapters namespace so that the adapter's logging level is in effect for command line calls to the adapter. As mentioned above, datalad is now loaded in all cases, so a handler is always configured, but, in case this changes in the future, add a simpler handler if one isn't already configured.
The same handling will be needed in the skopeo adapter. Avoid repeating it.
Some of the subprocess calls capture stderr. Show it to the caller on failure.
In order to be able to track Docker containers in a dataset, we introduced the docker-save-based docker adapter in 68a1462 (Add prototype of a Docker adapter, 2018-05-18). It's not clear how much this has been used, but at least conceptually it seems to be viable. One problem, however, is that ideally we'd be able to assign Docker registry URLs to the image files stored in the dataset (particularly the large non-configuration files). There doesn't seem to be a way to do this with the docker-save archives. Another option for storing the image in a dataset is the Open Container Initiative image format. Skopeo can be used to copy images in Docker registries (and some other destinations) to an OCI-compliant directory. When Docker Hub is used as the source, the resulting layers blobs can be re-obtained via GET /v2/NAME/blobs/ID. Using skopeo/OCI also has the advantage of making it easier to execute via podman in the future. Add an initial skopeo-based OCI adapter. At this point, it has the same functionality as the docker adapter.
After running `skopeo copy docker://docker.io/... oci:<dir>`, we can
link up the layer to the Docker registry. However, other digests
aren't preserved. One notable mismatch is between the image ID if you
run
docker pull x
versus
skopeo copy docker://x oci:x && skopeo copy oci:x docker-daemon:x
I haven't really wrapped my head around all the different digests and
when they can change. However, skopeo's issue tracker has a good deal
of discussion about this, and it looks complicated (e.g., issues 11,
469, 949, 1046, and 1097).
The adapter docstring should probably note this, though at this point
I'm not sure I could say something coherent. Anyway, add a to-do
note...
I _think_ containers-storage: is what we'd use for podman-run, but I haven't attempted it.
Prevent skopeo-copy output from being shown, since it's probably confusing to see output under run's "Command start (output follows)" tag for a command that the user didn't explicitly call. However, for large images, this has the downside that the user might want some signs of life, so this may need to be revisited.
We'll need this information in order to add a tag to the oci: destination and to make the entry copied to docker-daemon more informative. I've tried to base the rules on containers/image implementation, which is what skopeo uses underneath.
An image stored as an OCI directory can have a tag. If the source has a tag specified, copy it over to the destination. Note that in upcoming commits will store the full source specification as an image annotation, so we won't rely on this when copying the image to docker-daemon:, but it still seems nice to have (e.g., when looking at the directory with skopeo-inspect).
These will be used to store the value of the skopeo-copy source and then retrieve it at load time to make the docker-daemon: entry more informative.
The OCI format allows annotations. Add one with the source value (which will be determined by what the caller gives to containers-add) so that we can use this information when copying the information to a docker-daemon: destination.
The images copied to the daemon look like this
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
datalad-container/bb sha256-98345e4 98345e418eb7 3 weeks ago 69.2MB
That tag isn't useful because it just repeats the image ID. And the
name after "datalad-container/" is the name of the directory, so with
the default containers-add location it would be an uninformative
"image".
With the last commit, we store the source specification as an
annotation in the OCI directory. Parse it and reuse the original
repository name and tag.
REPOSITORY TAG IMAGE ID CREATED SIZE
datalad-container/debian buster-slim 98345e418eb7 3 weeks ago 69.2MB
If the source has a digest instead of the tag, construct the daemon
tag from that.
Add a new oci: scheme. The stacking of the schemes isn't ideal (oci:docker://, oci:docker-daemon:), but it allows for any skopeo transport to be used. Note: I'm not avoiding appending "//" for a conceptual reason (although there might be a valid one), but because I find "oci://docker://" to be ugly. Perhaps the consistency with "shub://" and "dhub://" outweighs that though.
The next commit will use this logic in the oci adapter as well, and, it'd be nice (though not strictly necessary) to avoid oci and containers_add importing each other.
TODO: Finalize approach in Datalad for Docker Registry URLs.
* origin/master: (217 commits) [DATALAD RUNCMD] Run pre-commit to harmonize code throughout Update __version__ to 1.2.6 [skip ci] Update CHANGELOG BF: use setuptools.errors.OptionError instead of now removed import of distutils.DistutilsOptionError BF: docbuild - use python 3.9 (not 3.8) and upgrade setuptools [DATALAD RUNCMD] Run pre-commit to harmonize code throughout rm duplicate .codespellrc and move some of its skips into pyproject.toml progress codespell in pre-commit Add precommit configuration as in datalad ATM [release-action] Autogenerate changelog snippet for PR 268 MNT: Account for a number of deprecations in core Revert linting a target return value for a container Fix lint errors other than line length upper case CWD acronym CI/tools: Add fuse2fs dependency for singularity installation Improving documentation for --url parameter Update __version__ to 1.2.5 [skip ci] Update CHANGELOG Add changelog entry for isort PR [DATALAD RUNCMD] isort all files for consistency ... Conflicts - some were tricky: datalad_container/adapters/docker.py datalad_container/containers_add.py datalad_container/utils.py - both added but merge looked funny
otherwise even singularity does not install
=== Do not change lines below ===
{
"chain": [],
"cmd": "sed -i -e 's,from distutils.spawn import find_executable,from shutil import which,g' -e 's,find_executable(,which(,g' datalad_container/adapters/tests/test_oci_more.py",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #277 +/- ##
===========================================
- Coverage 94.60% 83.65% -10.95%
===========================================
Files 24 28 +4
Lines 1112 1444 +332
===========================================
+ Hits 1052 1208 +156
- Misses 60 236 +176 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I guess it is too much of crippled system, I will move those commits into a separate PR, no point to occlude here. |
Added comprehensive documentation for Claude Code to work effectively with this codebase, including architecture overview, development commands, and key implementation details. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Extended the OCI adapter to support any container registry without
hardcoding endpoints. The link() function now dynamically constructs
registry API endpoints using the pattern https://{registry}/v2/, with
Docker Hub as the only special case (registry-1.docker.io).
This enables automatic support for registries like:
- quay.io (Quay.io registry)
- gcr.io (Google Container Registry)
- ghcr.io (GitHub Container Registry)
- Any other V2-compatible registry
Changes:
- Removed hardcoded _ENDPOINTS dictionary
- Added dynamic endpoint construction in link() function
- Added unit tests for parsing references from alternative registries
- Added integration tests using real images:
- ghcr.io/astral-sh/uv:latest for ghcr.io testing
- quay.io/linuxserver.io/baseimage-alpine:3.18 for quay.io testing
The link() function will add registry URLs to annexed layer images for
any registry when proper provider configuration is available, enabling
efficient retrieval through git-annex.
All new tests are marked with @pytest.mark.ai_generated as per project
standards.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Enhanced the parametrized registry test to include: 1. Docker Hub (docker.io) with busybox:1.30 for consistency 2. Verification that annexed blobs exist in the OCI image 3. Check that all annexed files have URLs registered in either the datalad or web remote for efficient retrieval The test now verifies that `git annex find --not --in datalad --and --not --in web` returns empty, ensuring all blobs are accessible through git-annex remotes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Enhanced the parametrized registry test to verify the complete drop/get cycle for the entire dataset: 1. Drops all annexed content in the dataset 2. Verifies that files were actually dropped (non-empty results) 3. Gets everything back from remotes 4. Verifies that files were retrieved (non-empty results) This ensures that the registered URLs in datalad/web remotes are functional and files can be successfully retrieved from the registry. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This fixture ensures that sys.executable's directory is first in PATH for the duration of tests. This is needed when tests spawn subprocesses that need to import modules from the same Python environment that's running pytest, preventing "No module named X" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…er handling - Add parametrized integration test covering docker.io, gcr.io, and quay.io - Test container addition, execution, and annexed blob verification - Add drop/get cycle testing to verify remote retrieval works - Fix link() to create datalad remote even without provider configuration - Issue warning instead of skipping when provider not found - Allows URLs to be registered and files to be retrieved from any registry - Use pytest tmp_path fixture instead of @with_tempfile decorator 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
@asmacdo try to use it with containers which might be of interest to you. |
Reincarnated Kyle's
since I want (almost need) to store OCI images and from them then generate SIF files.
Observations
TODOs:
+_ENDPOINTS = {"docker.io": "https://registry-1.docker.io/v2/"}but should be generic and skopeo should know...dataladspecial remote. For now enabled datalad special remote for alltested - and it seems to work just fine with our added warning here although we might even want to get rid of it
such python code shows how to check registry for needing auth token
import requests import json from urllib.parse import urlparse, parse_qs from typing import Optional, Dict class AuthenticatedBlobDownloader: """Download OCI blobs with proper registry authentication""" def __init__(self): self.session = requests.Session() self.token_cache = {} # Cache tokens per registry/repo def get_auth_token(self, registry: str, repository: str, username: Optional[str] = None, password: Optional[str] = None) -> Optional[str]: """ Get authentication token for a registry using the OAuth2 flow. Returns token for anonymous access to public repos, or authenticated access. """ cache_key = f"{registry}/{repository}" if cache_key in self.token_cache: return self.token_cache[cache_key] # Step 1: Probe the /v2/ endpoint to get WWW-Authenticate header probe_url = f"https://{registry}/v2/" response = self.session.get(probe_url) if response.status_code == 200: # No auth required return None if response.status_code != 401: raise ValueError(f"Unexpected response from registry: {response.status_code}") # Step 2: Parse WWW-Authenticate header www_auth = response.headers.get('WWW-Authenticate', '') if not www_auth.startswith('Bearer'): raise ValueError(f"Unsupported auth scheme: {www_auth}") # Parse realm, service, scope from header # Example: Bearer realm="https://ghcr.io/token",service="ghcr.io",scope="repository:user/repo:pull" auth_params = {} for part in www_auth.replace('Bearer ', '').split(','): if '=' in part: key, value = part.split('=', 1) auth_params[key.strip()] = value.strip('"') realm = auth_params.get('realm') service = auth_params.get('service') if not realm: raise ValueError(f"No realm in WWW-Authenticate: {www_auth}") # Step 3: Request token from auth endpoint token_params = { 'service': service, 'scope': f'repository:{repository}:pull' } if username and password: # Authenticated request token_response = self.session.get( realm, params=token_params, auth=(username, password) ) else: # Anonymous request (works for public repos) token_response = self.session.get(realm, params=token_params) token_response.raise_for_status() token_data = token_response.json() token = token_data.get('token') or token_data.get('access_token') if not token: raise ValueError(f"No token in response: {token_data}") # Cache the token self.token_cache[cache_key] = token return token def download_blob(self, blob_url: str, output_path: str, username: Optional[str] = None, password: Optional[str] = None) -> str: """ Download a blob from a registry with proper authentication. Args: blob_url: Full URL like https://ghcr.io/v2/repo/blobs/sha256:... output_path: Where to save the downloaded blob username: Optional username for private repos password: Optional password/token for private repos Returns: Path to downloaded file """ # Parse the blob URL parsed = urlparse(blob_url) registry = parsed.netloc # Extract repository from path: /v2/REPO/blobs/DIGEST path_parts = parsed.path.split('/') if len(path_parts) < 5 or path_parts[1] != 'v2' or path_parts[-2] != 'blobs': raise ValueError(f"Invalid blob URL format: {blob_url}") # Repository is everything between /v2/ and /blobs/ repository = '/'.join(path_parts[2:-2]) digest = path_parts[-1] print(f"Downloading from {registry}/{repository}") print(f" Digest: {digest}") # Get authentication token token = self.get_auth_token(registry, repository, username, password) # Download the blob headers = {} if token: headers['Authorization'] = f'Bearer {token}' print(f" Using auth token: {token[:30]}...") else: print(" No authentication required") response = self.session.get(blob_url, headers=headers, stream=True) response.raise_for_status() # Save to file with open(output_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): if chunk: f.write(chunk) print(f" ✓ Downloaded to {output_path}") return output_path # Usage example if __name__ == "__main__": downloader = AuthenticatedBlobDownloader() # Example 1: Download from ghcr.io (public repo) blob_url = "https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf" downloader.download_blob(blob_url, "layer.tar.gz") # Example 2: Download from a private repo (with credentials) # downloader.download_blob( # "https://ghcr.io/v2/myorg/private-repo/blobs/sha256:...", # "private-layer.tar.gz", # username="myuser", # password="ghp_myPersonalAccessToken" # )python code to potentially get endpoints programmatically although it remains needing adhoc fix for docker hub