Skip to content

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Sep 27, 2025

Reincarnated Kyle's

since I want (almost need) to store OCI images and from them then generate SIF files.

Observations

  • works good with image pointing to subdataset

TODOs:

  • review addition of URLs since seems to not add for e.g. oci:docker://quay.io/singularity/singularity:v3.9.0-slim; ATM relies on hardcoded +_ENDPOINTS = {"docker.io": "https://registry-1.docker.io/v2/"} but should be generic and skopeo should know...
  • some registries would require Bearer token auth. We have some implementation in stock datalad but with a warning. The point is that we might need to add them so they below to datalad special remote. For now enabled datalad special remote for all
  • TODO (later/may be): need to add datalad special remote providers for other registries since they all need bearer token ... since not in stock datalad ATM, I guess we better do it dynamically at run time ATM somehow.
tested - and it seems to work just fine with our added warning here although we might even want to get rid of it
❯ datalad containers-add --url oci:docker://gcr.io/google-containers/busybox:latest test-2

Getting image source signatures
Copying blob a3ed95caeb02 done  
Copying blob a3ed95caeb02 done  
Copying blob 138cfc514ce4 done  
Copying blob a3ed95caeb02 skipped: already exists  
Copying config a8abf0c769 done   | 
Writing manifest to image destination
add(ok): .datalad/environments/test-2/image/blobs/sha256/138cfc514ce4b3f1f8d57b2f9766fcb5ffab791110bcd8610e8d762cc78d28b2 (file)                                                                                               
add(ok): .datalad/environments/test-2/image/blobs/sha256/a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 (file)                                                                                               
add(ok): .datalad/environments/test-2/image/blobs/sha256/a8abf0c7690539c07cf95d28ec8e66922288dc26869569304a2ba3a2d5a78540 (file)                                                                                               
add(ok): .datalad/environments/test-2/image/blobs/sha256/d2b1a07a7a73df9fcab3cedafd00fd548345f58efa52680fab80b612d061b534 (file)                                                                                               
add(ok): .datalad/environments/test-2/image/index.json (file)                                                                                                                                                                  
add(ok): .datalad/environments/test-2/image/oci-layout (file)                                                                                                                                                                  
add(ok): .datalad/config (file)                                                                                                                                                                                                
save(ok): . (dataset)                                                                                                                                                                                                          
action summary:                                                                                                                                                                                                                
  add (ok: 7)
  save (ok: 1)
add(ok): .datalad/environments/test-2/image/blobs/sha256/138cfc514ce4b3f1f8d57b2f9766fcb5ffab791110bcd8610e8d762cc78d28b2 (file)
add(ok): .datalad/environments/test-2/image/blobs/sha256/a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 (file)
add(ok): .datalad/environments/test-2/image/blobs/sha256/a8abf0c7690539c07cf95d28ec8e66922288dc26869569304a2ba3a2d5a78540 (file)
add(ok): .datalad/environments/test-2/image/blobs/sha256/d2b1a07a7a73df9fcab3cedafd00fd548345f58efa52680fab80b612d061b534 (file)
add(ok): .datalad/environments/test-2/image/index.json (file)
add(ok): .datalad/environments/test-2/image/oci-layout (file)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /tmp/test-oci/.datalad/environments/test-2/image (file)
[WARNING] Required Datalad provider configuration for Docker registry links not detected. We will enable 'datalad' special remote anyways but datalad might issue warnings later on. 
action summary:
  add (ok: 7)
  containers_add (ok: 1)
  save (ok: 1)
❯ git annex drop --all
drop MD5E-s1142686--ced8b461027cb2e2ee11c9ad670b749b ok
drop MD5E-s32--54a01009f17bdb7ec1dd1cb427244304 ok
drop MD5E-s783033--f1a49b4fd6d4fce7dbac8c4694672706 ok
drop MD5E-s480923--008740f932de66855019ca24fbceef1b ok
(recording state in git...)
❯ datalad get -J5 .
get(ok): .datalad/environments/test-2/image/blobs/sha256/a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 (file) [from datalad...]                                                                             
get(ok): .datalad/environments/test-1/image/blobs/sha256/1617e25568b2231fdd0d5caff63b06f6f7738d8d961f031c80e47d35aaec9733 (file) [from datalad...]                                                                             
get(ok): .datalad/environments/test-1/image/blobs/sha256/9fa9226be034e47923c0457d916aa68474cdfb23af8d4525e9baeebc4760977a (file) [from datalad...]
get(ok): .datalad/environments/test-2/image/blobs/sha256/138cfc514ce4b3f1f8d57b2f9766fcb5ffab791110bcd8610e8d762cc78d28b2 (file) [from datalad...]
action summary:
  get (ok: 4)
such python code shows how to check registry for needing auth token
import requests
import json
from urllib.parse import urlparse, parse_qs
from typing import Optional, Dict

class AuthenticatedBlobDownloader:
    """Download OCI blobs with proper registry authentication"""
    
    def __init__(self):
        self.session = requests.Session()
        self.token_cache = {}  # Cache tokens per registry/repo
    
    def get_auth_token(self, registry: str, repository: str, 
                       username: Optional[str] = None, 
                       password: Optional[str] = None) -> Optional[str]:
        """
        Get authentication token for a registry using the OAuth2 flow.
        Returns token for anonymous access to public repos, or authenticated access.
        """
        cache_key = f"{registry}/{repository}"
        if cache_key in self.token_cache:
            return self.token_cache[cache_key]
        
        # Step 1: Probe the /v2/ endpoint to get WWW-Authenticate header
        probe_url = f"https://{registry}/v2/"
        response = self.session.get(probe_url)
        
        if response.status_code == 200:
            # No auth required
            return None
        
        if response.status_code != 401:
            raise ValueError(f"Unexpected response from registry: {response.status_code}")
        
        # Step 2: Parse WWW-Authenticate header
        www_auth = response.headers.get('WWW-Authenticate', '')
        if not www_auth.startswith('Bearer'):
            raise ValueError(f"Unsupported auth scheme: {www_auth}")
        
        # Parse realm, service, scope from header
        # Example: Bearer realm="https://ghcr.io/token",service="ghcr.io",scope="repository:user/repo:pull"
        auth_params = {}
        for part in www_auth.replace('Bearer ', '').split(','):
            if '=' in part:
                key, value = part.split('=', 1)
                auth_params[key.strip()] = value.strip('"')
        
        realm = auth_params.get('realm')
        service = auth_params.get('service')
        
        if not realm:
            raise ValueError(f"No realm in WWW-Authenticate: {www_auth}")
        
        # Step 3: Request token from auth endpoint
        token_params = {
            'service': service,
            'scope': f'repository:{repository}:pull'
        }
        
        if username and password:
            # Authenticated request
            token_response = self.session.get(
                realm, 
                params=token_params,
                auth=(username, password)
            )
        else:
            # Anonymous request (works for public repos)
            token_response = self.session.get(realm, params=token_params)
        
        token_response.raise_for_status()
        token_data = token_response.json()
        
        token = token_data.get('token') or token_data.get('access_token')
        if not token:
            raise ValueError(f"No token in response: {token_data}")
        
        # Cache the token
        self.token_cache[cache_key] = token
        return token
    
    def download_blob(self, blob_url: str, output_path: str,
                     username: Optional[str] = None,
                     password: Optional[str] = None) -> str:
        """
        Download a blob from a registry with proper authentication.
        
        Args:
            blob_url: Full URL like https://ghcr.io/v2/repo/blobs/sha256:...
            output_path: Where to save the downloaded blob
            username: Optional username for private repos
            password: Optional password/token for private repos
        
        Returns:
            Path to downloaded file
        """
        # Parse the blob URL
        parsed = urlparse(blob_url)
        registry = parsed.netloc
        
        # Extract repository from path: /v2/REPO/blobs/DIGEST
        path_parts = parsed.path.split('/')
        if len(path_parts) < 5 or path_parts[1] != 'v2' or path_parts[-2] != 'blobs':
            raise ValueError(f"Invalid blob URL format: {blob_url}")
        
        # Repository is everything between /v2/ and /blobs/
        repository = '/'.join(path_parts[2:-2])
        digest = path_parts[-1]
        
        print(f"Downloading from {registry}/{repository}")
        print(f"  Digest: {digest}")
        
        # Get authentication token
        token = self.get_auth_token(registry, repository, username, password)
        
        # Download the blob
        headers = {}
        if token:
            headers['Authorization'] = f'Bearer {token}'
            print(f"  Using auth token: {token[:30]}...")
        else:
            print("  No authentication required")
        
        response = self.session.get(blob_url, headers=headers, stream=True)
        response.raise_for_status()
        
        # Save to file
        with open(output_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
        
        print(f"  ✓ Downloaded to {output_path}")
        return output_path


# Usage example
if __name__ == "__main__":
    downloader = AuthenticatedBlobDownloader()
    
    # Example 1: Download from ghcr.io (public repo)
    blob_url = "https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf"
    downloader.download_blob(blob_url, "layer.tar.gz")
    
    # Example 2: Download from a private repo (with credentials)
    # downloader.download_blob(
    #     "https://ghcr.io/v2/myorg/private-repo/blobs/sha256:...",
    #     "private-layer.tar.gz",
    #     username="myuser",
    #     password="ghp_myPersonalAccessToken"
    # )
python code to potentially get endpoints programmatically although it remains needing adhoc fix for docker hub
import subprocess
import json

def get_layer_urls_from_skopeo(image_ref: str) -> dict:
    """
    Extract layer URLs with ZERO hardcoding.
    Trust skopeo's normalized Name field as the source of truth.
    """
    if not image_ref.startswith("docker://"):
        image_ref = f"docker://{image_ref}"
    
    # Let skopeo do all the work
    result = subprocess.run(
        ["skopeo", "inspect", image_ref],
        capture_output=True, text=True, check=True
    )
    data = json.loads(result.stdout)
    
    # Skopeo's Name field is the authoritative source
    name = data["Name"]
    parts = name.split('/', 1)
    registry = parts[0]
    repository = parts[1] if len(parts) > 1 else ""
    
    # THE KEY INSIGHT: For most registries, registry hostname = API endpoint
    # The ONLY common exception is docker.io -> registry-1.docker.io
    # But we can detect this by checking if skopeo used docker.io or index.docker.io
    
    # If it's docker.io or index.docker.io, the API is at registry-1.docker.io
    # This is the ONE hardcoded fact about Docker Hub's architecture
    if registry in ["docker.io", "index.docker.io"]:
        api_host = "registry-1.docker.io"
    else:
        # For ALL other registries: hostname = API endpoint
        api_host = registry
    
    provenance = {
        "name": name,
        "registry": registry,
        "repository": repository,
        "api_endpoint": api_host,
        "digest": data["Digest"],
        "layers": [
            {
                "digest": layer["Digest"],
                "size": layer["Size"],
                "url": f"https://{api_host}/v2/{repository}/blobs/{layer['Digest']}"
            }
            for layer in data.get("LayersData", [])
        ]
    }
    
    return provenance


# Test
for img in [
    "ghcr.io/con/nwb2bids:v0.5.0",
    "quay.io/singularity/singularity:v3.9.0-slim",
    "ubuntu:latest"
]:
    print(f"\n{'='*60}\n{img}")
    prov = get_layer_urls_from_skopeo(img)
    print(f"Registry: {prov['registry']}")
    print(f"API Endpoint: {prov['api_endpoint']}")
    print(f"Blob URL: {prov['layers'][0]['url']}")
❯ datalad download-url https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf
[INFO   ] Downloading 'https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf' into '/tmp/' 
Access to https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf has failed.
Would you like to setup a new provider configuration to access url? (choices: [yes], no): yes

New provider name
Unique name to identify 'provider' for https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf [ghcr.io]: 

New provider regular expression
A (Python) regular expression to specify for which URLs this provider should be used [https://ghcr\.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf]: https://ghcr\.io/v2/.*

Authentication type
What authentication type to use (choices: aws-s3, bearer_token, bearer_token_anon, html_form, http_auth, http_basic_auth, http_digest_auth, http_token, loris-token, nda-s3, none, xnat): bearer_token_anon 

Credential
What type of credential should be used? (choices: aws-s3, git, loris-token, nda-s3, [token], user_password): 

Save provider configuration file
Following configuration will be written to /home/yoh/.config/datalad/providers/ghcr.io.cfg:
# Provider configuration file created to initially access
# https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf

[provider:ghcr.io]
url_re = https://ghcr\.io/v2/.*
authentication_type = bearer_token_anon
# Note that you might need to specify additional fields specific to the
# authenticator.  Fow now "look into the docs/source" of <class 'datalad.downloaders.http.HTTPAnonBearerTokenAuthenticator'>
# bearer_token_anon_
credential = ghcr.io

[credential:ghcr.io]
# If known, specify URL or email to how/where to request credentials
# url = ???
type = token
 (choices: [yes], no): yes

You need to authenticate with 'ghcr.io' credentials.
token: 
[WARNING] Argument 'credential' specified, but it will be ignored: Token(auth_url=<<'https://ghcr.++93 chars++cdf'>>, ds=None, name='ghcr.io', url=None) 
download_url(ok): /tmp/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf (file)                                                                                                                          
❯ file sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf
sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf: gzip compressed data, was "rootfs.tar", max compression, from Unix, original size modulo 2^32 81039360
❯ rm sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf
❯ datalad download-url https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf
[INFO   ] Downloading 'https://ghcr.io/v2/con/nwb2bids/blobs/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf' into '/tmp/' 
[WARNING] Argument 'credential' specified, but it will be ignored: Token(auth_url=<<'https://ghcr.++93 chars++cdf'>>, ds=None, name='ghcr.io', url=None) 
download_url(ok): /tmp/sha256:8c7716127147648c1751940b9709b6325f2256290d3201662eca2701cadb2cdf (file)      

kyleam and others added 24 commits December 4, 2020 16:10
These parts will be useful for the upcoming skopeo adapter as well.
"get" is probably clearer than "list", and tacking on "_ids" makes it
clearer what the return value is.

Also, drop the leading underscore, which is a holdover from the
function being in the adapters.docker module.
The minimum Python version of DataLad is new enough that we can assume
subprocess.run() is available.  It's recommended by the docs, and I
like it more, so switch to it.

Note that we might want to eventually switch to using WitlessRunner
here.  The original idea with using the subprocess module directly was
that it'd be nice for the docker adapter to be standalone, as nothing
in the adapter depended on datalad at the time.  That's not the case
anymore after the adapters.utils split and the use of datalad.utils
within it.  (And the upcoming skopeo adapter will make heavier use of
datalad for adding URLs to the layers.)
This logic will get a bit more involved in the next commit, and it
will be needed by the skopeo adapter too.
When the adapter is called from the command line (as containers-run
does) and datalad gets imported, the level set via the --verbose
argument doesn't have an effect and logging happens twice, once
through datalad's handler and once through the adapter's.

Before 313c4f0 (WIN/Workaround: don't pass gid and uid to docker run
call, 2020-11-10), the above was the case when docker.main() was
triggered with the documented `python -m datalad_container.adapters
...` invocation, but not when the script path was passed to python.
Following that commit, the adapter imports datalad, so datalad's
logger is always configured.

Adjust setup_logger() to set the log level of loggers under the
datalad.containers.adapters namespace so that the adapter's logging
level is in effect for command line calls to the adapter.

As mentioned above, datalad is now loaded in all cases, so a handler
is always configured, but, in case this changes in the future, add a
simpler handler if one isn't already configured.
The same handling will be needed in the skopeo adapter.  Avoid
repeating it.
Some of the subprocess calls capture stderr.  Show it to the caller on
failure.
In order to be able to track Docker containers in a dataset, we
introduced the docker-save-based docker adapter in 68a1462 (Add
prototype of a Docker adapter, 2018-05-18).  It's not clear how much
this has been used, but at least conceptually it seems to be viable.
One problem, however, is that ideally we'd be able to assign Docker
registry URLs to the image files stored in the dataset (particularly
the large non-configuration files).  There doesn't seem to be a way to
do this with the docker-save archives.

Another option for storing the image in a dataset is the Open
Container Initiative image format.  Skopeo can be used to copy images
in Docker registries (and some other destinations) to an OCI-compliant
directory.  When Docker Hub is used as the source, the resulting
layers blobs can be re-obtained via GET /v2/NAME/blobs/ID.

Using skopeo/OCI also has the advantage of making it easier to execute
via podman in the future.

Add an initial skopeo-based OCI adapter.  At this point, it has the
same functionality as the docker adapter.
After running `skopeo copy docker://docker.io/... oci:<dir>`, we can
link up the layer to the Docker registry.  However, other digests
aren't preserved.  One notable mismatch is between the image ID if you
run

    docker pull x

versus

    skopeo copy docker://x oci:x && skopeo copy oci:x docker-daemon:x

I haven't really wrapped my head around all the different digests and
when they can change.  However, skopeo's issue tracker has a good deal
of discussion about this, and it looks complicated (e.g., issues 11,
469, 949, 1046, and 1097).

The adapter docstring should probably note this, though at this point
I'm not sure I could say something coherent.  Anyway, add a to-do
note...
I _think_ containers-storage: is what we'd use for podman-run, but I
haven't attempted it.
Prevent skopeo-copy output from being shown, since it's probably
confusing to see output under run's "Command start (output follows)"
tag for a command that the user didn't explicitly call.  However, for
large images, this has the downside that the user might want some
signs of life, so this may need to be revisited.
We'll need this information in order to add a tag to the oci:
destination and to make the entry copied to docker-daemon more
informative.  I've tried to base the rules on containers/image
implementation, which is what skopeo uses underneath.
An image stored as an OCI directory can have a tag.  If the source has
a tag specified, copy it over to the destination.

Note that in upcoming commits will store the full source specification
as an image annotation, so we won't rely on this when copying the
image to docker-daemon:, but it still seems nice to have (e.g., when
looking at the directory with skopeo-inspect).
These will be used to store the value of the skopeo-copy source and
then retrieve it at load time to make the docker-daemon: entry more
informative.
The OCI format allows annotations.  Add one with the source value
(which will be determined by what the caller gives to containers-add)
so that we can use this information when copying the information to a
docker-daemon: destination.
The images copied to the daemon look like this

    $ docker images
    REPOSITORY             TAG                 IMAGE ID            CREATED             SIZE
    datalad-container/bb   sha256-98345e4      98345e418eb7        3 weeks ago         69.2MB

That tag isn't useful because it just repeats the image ID.  And the
name after "datalad-container/" is the name of the directory, so with
the default containers-add location it would be an uninformative
"image".

With the last commit, we store the source specification as an
annotation in the OCI directory.  Parse it and reuse the original
repository name and tag.

   REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
   datalad-container/debian   buster-slim         98345e418eb7        3 weeks ago         69.2MB

If the source has a digest instead of the tag, construct the daemon
tag from that.
Add a new oci: scheme.  The stacking of the schemes isn't ideal
(oci:docker://, oci:docker-daemon:), but it allows for any skopeo
transport to be used.

Note: I'm not avoiding appending "//" for a conceptual reason
(although there might be a valid one), but because I find
"oci://docker://" to be ugly.  Perhaps the consistency with "shub://"
and "dhub://" outweighs that though.
The next commit will use this logic in the oci adapter as well, and,
it'd be nice (though not strictly necessary) to avoid oci and
containers_add importing each other.
TODO: Finalize approach in Datalad for Docker Registry URLs.
* origin/master: (217 commits)
  [DATALAD RUNCMD] Run pre-commit to harmonize code throughout
  Update __version__ to 1.2.6
  [skip ci] Update CHANGELOG
  BF: use setuptools.errors.OptionError instead of now removed import of distutils.DistutilsOptionError
  BF: docbuild - use python 3.9 (not 3.8) and upgrade setuptools
  [DATALAD RUNCMD] Run pre-commit to harmonize code throughout
  rm duplicate .codespellrc and move some of its skips into pyproject.toml
  progress codespell in pre-commit
  Add precommit configuration as in datalad ATM
  [release-action] Autogenerate changelog snippet for PR 268
  MNT: Account for a number of deprecations in core
  Revert linting a target return value for a container
  Fix lint errors other than line length
  upper case CWD acronym
  CI/tools: Add fuse2fs dependency for singularity installation
  Improving documentation for --url parameter
  Update __version__ to 1.2.5
  [skip ci] Update CHANGELOG
  Add changelog entry for isort PR
  [DATALAD RUNCMD] isort all files for consistency
  ...

 Conflicts - some were tricky:
	datalad_container/adapters/docker.py
	datalad_container/containers_add.py
	datalad_container/utils.py - both added but merge looked funny
otherwise even singularity does not install
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "sed -i -e 's,from distutils.spawn import find_executable,from shutil import which,g' -e 's,find_executable(,which(,g' datalad_container/adapters/tests/test_oci_more.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@yarikoptic yarikoptic added CHANGELOG-missing minor Increment the minor version when merged labels Sep 27, 2025
@codecov
Copy link

codecov bot commented Sep 27, 2025

Codecov Report

❌ Patch coverage is 56.78670% with 156 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.65%. Comparing base (89aab60) to head (11bbf45).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
datalad_container/adapters/oci.py 42.85% 88 Missing ⚠️
datalad_container/adapters/tests/test_oci_more.py 18.18% 54 Missing ⚠️
datalad_container/adapters/utils.py 87.17% 5 Missing ⚠️
datalad_container/conftest.py 66.66% 5 Missing ⚠️
datalad_container/containers_add.py 60.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #277       +/-   ##
===========================================
- Coverage   94.60%   83.65%   -10.95%     
===========================================
  Files          24       28        +4     
  Lines        1112     1444      +332     
===========================================
+ Hits         1052     1208      +156     
- Misses         60      236      +176     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic
Copy link
Member Author

I guess it is too much of crippled system, I will move those commits into a separate PR, no point to occlude here.

yarikoptic and others added 6 commits October 15, 2025 09:43
Added comprehensive documentation for Claude Code to work effectively with
this codebase, including architecture overview, development commands, and
key implementation details.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Extended the OCI adapter to support any container registry without
hardcoding endpoints. The link() function now dynamically constructs
registry API endpoints using the pattern https://{registry}/v2/, with
Docker Hub as the only special case (registry-1.docker.io).

This enables automatic support for registries like:
- quay.io (Quay.io registry)
- gcr.io (Google Container Registry)
- ghcr.io (GitHub Container Registry)
- Any other V2-compatible registry

Changes:
- Removed hardcoded _ENDPOINTS dictionary
- Added dynamic endpoint construction in link() function
- Added unit tests for parsing references from alternative registries
- Added integration tests using real images:
  - ghcr.io/astral-sh/uv:latest for ghcr.io testing
  - quay.io/linuxserver.io/baseimage-alpine:3.18 for quay.io testing

The link() function will add registry URLs to annexed layer images for
any registry when proper provider configuration is available, enabling
efficient retrieval through git-annex.

All new tests are marked with @pytest.mark.ai_generated as per project
standards.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Enhanced the parametrized registry test to include:
1. Docker Hub (docker.io) with busybox:1.30 for consistency
2. Verification that annexed blobs exist in the OCI image
3. Check that all annexed files have URLs registered in either the
   datalad or web remote for efficient retrieval

The test now verifies that `git annex find --not --in datalad --and
--not --in web` returns empty, ensuring all blobs are accessible
through git-annex remotes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Enhanced the parametrized registry test to verify the complete
drop/get cycle for the entire dataset:

1. Drops all annexed content in the dataset
2. Verifies that files were actually dropped (non-empty results)
3. Gets everything back from remotes
4. Verifies that files were retrieved (non-empty results)

This ensures that the registered URLs in datalad/web remotes are
functional and files can be successfully retrieved from the registry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This fixture ensures that sys.executable's directory is first in PATH
for the duration of tests. This is needed when tests spawn subprocesses
that need to import modules from the same Python environment that's
running pytest, preventing "No module named X" errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…er handling

- Add parametrized integration test covering docker.io, gcr.io, and quay.io
- Test container addition, execution, and annexed blob verification
- Add drop/get cycle testing to verify remote retrieval works
- Fix link() to create datalad remote even without provider configuration
  - Issue warning instead of skipping when provider not found
  - Allows URLs to be registered and files to be retrieved from any registry
- Use pytest tmp_path fixture instead of @with_tempfile decorator

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@yarikoptic yarikoptic requested review from asmacdo and mih October 20, 2025 13:37
@yarikoptic yarikoptic marked this pull request as ready for review October 20, 2025 13:37
@yarikoptic
Copy link
Member Author

@asmacdo try to use it with containers which might be of interest to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants