Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker image is bigger than it needs to be #157

Open
daha opened this issue Jan 24, 2025 · 2 comments · May be fixed by #158
Open

Docker image is bigger than it needs to be #157

daha opened this issue Jan 24, 2025 · 2 comments · May be fixed by #158

Comments

@daha
Copy link

daha commented Jan 24, 2025

I saw that the datahub-actions docker image (latest) is a bit bigger than it needs to be.

As the RUN command on

RUN UV_LINK_MODE=copy uv pip install -e ".[all]"
do not include a --mount=type=cache... like on https://github.com/acryldata/datahub-actions/blob/main/docker/datahub-actions/Dockerfile#L44C5-L44C77 and is using UV_LINK_MODE=copy.

The size of acryldata/datahub-actions:v0.1.9 reported by docker images is 3.08GB.

I did a little experiment where I changed Line 49 like below, and then the size was 2.27GB.

RUN --mount=type=cache,target=/datahub-ingestion/.cache/uv,uid=1000,gid=1000 \
    UV_LINK_MODE=copy uv pip install -e ".[all]"

When I instead removed UV_LINK_MODE=copy like below from line 49, the docker image was 2.35GB, as it instead use hard links (default with UV) so all the package files are not duplicated.

RUN uv pip install -e ".[all]"

I think it would be best to remove the /datahub-ingestion/.cache/uv from the docker image by specifying --mount=type=cache... when it is built and if you really want to keep the uv .cache then remove UV_LINK_MODE instead to avoid many duplicated files.

I can create a PR for any of the alternatives if I only know which alternative you prefer.

@hsheth2
Copy link
Contributor

hsheth2 commented Jan 24, 2025

Yup let's change line 49. Adding the --mount=type=cache,target=/datahub-ingestion/.cache/uv,uid=1000,gid=1000 makes sense, and matches the behavior from line 44.

iirc we need to use UV_LINK_MODE=copy when using the cache mount, since hard links don't work across the cache volume into the base image layer.

A PR for this would be great :)

@daha daha linked a pull request Jan 24, 2025 that will close this issue
@daha
Copy link
Author

daha commented Jan 24, 2025

Yes, you are correct about UV_LINK_MODE=copy, but if we would keep the uv cache in the image it would have been a different thing and we would not need UV_LINK_MODE=copy as the cache is on the same drive.

I have created a PR #158 now, as you have seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants