Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up docker image building and switch base image to alpine #17731

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

FrankChen021
Copy link
Member

@FrankChen021 FrankChen021 commented Feb 15, 2025

There are several problems in the Dockerfile

1. Extreme slow building on Apple Silicon Chips

Previously, to allow building docker on Apple Silicon Chips like M1, the docker file forces the building under the amd64 platform. This is to address the building problem that node-sass does not support ARM, see #13012

FROM --platform=linux/amd64 maven:3.9 as builder

However, this drastically slows down the docker building process on these platforms, like it takes more than 15 minutes to build an image on my M1 laptop.

The main reason is that Apple has to use x86 emulator to run the building process.

2. Unfriendly to debug

Currently the distroless base image is used, it's a secure image but it's unfriendly to debug. there's no curl, no wget, no lsof, and nettools. It's painful to debug if we have to debug some live issues.

3. web-console is repeatedly built even if it's not changed

Most of the development does not involve the web-console module, now it's part of the building process of other backend services when using mvn package command.
Since the web-console module take time to build, it also slows down the building process

And there're some other problems which are described in the following section.

Changes Description

  1. The entire building process is split into two stages, the web-console build stage which runs under amd64 platform, and the distribution building stage which adapts local development platform. And during the distribution building stage, the web-console will be copied for final distribution package.

    This improves the building process drastically. Now on my laptop, it takes 120 seconds to complete the web-console building stage, and 210 seconds to complete the backend service building stage which are acceptable.

 => [web-console-builder 4/4] RUN --mount=type=cache,target=/root/.m2 if [ "true" = "true" ]; then     cd /src/web-console && mvn -B -ff -DskipUTs clean package; fi       126.4s
 => [builder 4/7] WORKDIR /src                                                                                                                                               0.0s
 => [builder 5/7] COPY --from=web-console-builder /src/web-console/target/web-console*.jar /src/web-console/target/                                                          0.0s
 => [builder 6/7] RUN --mount=type=cache,target=/root/.m2 if [ "true" = "true" ]; then       mvn -B -ff       clean install       -Pdist,bundle-contrib-exts       -Pskip  211.5s
  1. DO NOT use mvn to build web-console
    This can greatly improve the building performance when contents under web-console directory are not changed by leveraging the docker cache.

To make it, we bulid the web-console in a node image directly. In development, when web-console module is not changed, this reduces the entire building process of web-console

  1. Unifed the JDK during building and final run environment

Previously, the maven:3.9, which comes with JDK17, is used for building stage. This does NOT respect the JDK_VERSION argument in the docker file. This means if we're going to build druid in 21 by specifying the JDK_VERSION, the distribution was still buit under JDK17 but packaged to run in JRE 21 environment.

In this PR, this is fixed. The buliding stage and final image use the SAME version of JDK

  1. Switching base from gcr.io/distroless/java$JDK_VERSION-debian12 to alpine

This also drastically simplifies the docker file. Previously, we have to install busybox, download bash from somewhere in the Dockerfile, which makes the Dockerfile very complicated.

Since alpine comes with shell, these steps are eliminated. The change does NOT involve size bloat of image. On my local it shows that size of alpine based image is 746MB which is a little bit smaller than that of distroless image.

druid                         latest                     6eb4ec6dc77f   34 minutes ago   746MB
druid                         distroless                 1daa75c32b0c   7 hours ago      761MB

And some command used tools like curl,lsof,netools are packaged in the final docker image.

  1. Remove the evaluation of VERSION

Previously we use the following command to evaluate the version, but this step takes VERY LONG time on my laptop

RUN --mount=type=cache,target=/root/.m2 VERSION=$(mvn -B -q org.apache.maven.plugins:maven-help-plugin:3.2.0:evaluate \
      -Dexpression=project.version -DforceStdout=true \
    ) \
...

We can see that after 254 seconds, the command is still running.

 => [builder 7/8] RUN VERSION=$(mvn -B -q org.apache.maven.plugins:maven-help-plugin:3.2.0:evaluate       -Dstyle.color=never -Dexpression=project.version -DforceStdout=  254.3s

This is eliminated because by applying 'clean' to the maven command, we ensure that there's only one tar file under the distribution and we can use wild match to find the file and decompress it

  1. test-related modules are execluded from distribution stage.

  2. druid.sh is also updated to ensure druid.host has value before starting java process. This helps exposing problem more earlier.

Release note

The default image is switched from gcr.io/distroless/java17-debian12 to alpine

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • been tested in a test Druid cluster.

@FrankChen021 FrankChen021 added the Docker https://hub.docker.com/r/apache/druid label Feb 15, 2025
@github-actions github-actions bot added the GHA label Feb 17, 2025
@kgyrtkirk
Copy link
Member

I was taking a look and was wondering about the following:

  • I feel like the BUILD_FROM_SOURCE option is very wierd; why build these inside docker instead of packaging the realse into a docker image?
    ** is it possible with an M1 to build the dist build on the host and avoid building inside docker?
  • I don't really like that the new docker build customizes the distribution build logic inside the Dockerfile - with the hazard of using differently versioned tools
  • I think the web-console module doesn't correctly support incremental builds; so it gets rebuild every time; I think fixing that would also make these things painfull - using the docker cache implicitly adds a incremental build option...

@FrankChen021
Copy link
Member Author

I was taking a look and was wondering about the following:

  • I feel like the BUILD_FROM_SOURCE option is very wierd; why build these inside docker instead of packaging the realse into a docker image?
    ** is it possible with an M1 to build the dist build on the host and avoid building inside docker?

the BUILD_FROM_SOURCE is a legacy feature that I didn't make change and keep it. However, this is how I built the docker image on my M1 when building docker directly takes long time. The problem for M1 is that buliding a docker images is divided into two steps, first building distribution jar on host machine, then using this environment to build a docker image. This should be fixed because sometimes i even can't remember i should follow these two steps to get a docker image.

  • I don't really like that the new docker build customizes the distribution build logic inside the Dockerfile - with the hazard of using differently versioned tools

The core problem here is that web-console is different from backend services, it's a front-end project that has its own building toolchain

  • I think the web-console module doesn't correctly support incremental builds; so it gets rebuild every time; I think fixing that would also make these things painfull - using the docker cache implicitly adds a incremental build option...

This is why I made some changes to the web-console module so that we can use the docker cache

@kgyrtkirk
Copy link
Member

i even can't remember i should follow these two steps to get a docker image.

yeah - things could be complicated ; maybe it would be usefull to place a script under the dev folder?

This is why I made some changes to the web-console module so that we can use the docker cache

I wonder if there is a way to convince maven to not rebuild that all the time; that comes up a lot of other places as well..so fixing it more deeply could address those as well..

Thank you for the insights - I think the best would be to re-pack the dist tarball which was produce outside docker (make BUILD_FROM_SOURCE=false the default); do you think that would work well with your M1 based system?

@FrankChen021
Copy link
Member Author

Building the distribution tarball outside docker is not a good idea, because we can't guarantee the host machine installs the correct version of jdk. So this would be only an option for local development.

@kgyrtkirk
Copy link
Member

I don't think the installed jdk should matter - what could that alter?
We are building with the --release option - so the runtime will be right.
Even the release guideline doesn't highlight that at all here
I believe this approach would only be used during development so I don't think it would matter that much.

So if you are able to build it then that should be the best - you will rely on maven to do the incremental build ; if you want to do a full rebuild ; you could git clean -dfx...

I would really like to know if you are able to build or not the project in a reasonable time on your M1 system with BUILD_FROM_SOURCE=false set?

@FrankChen021
Copy link
Member Author

why doesn't the jdk matter if the BUILD_FROM_SOURCE is changed to false? Dockerfile strictly defines the building and running environment, why should we always build the image outside the docker while the host machine might have its own JDK release?

Like for me, JDK8, JDK17, JDK21 are both installed, before JDK17 is well supported, I have to use export JAVA_HOME to specify using of JDK8. And now earlier JDK21 is also buggy to build druid, someone may also need to switch JDK before building. Turning this option into false by default introduces some extra work for some existing users.

BUILD_FROM_SOURCE=false definitely saves the building time(no need to prepare docker building context, no need to copy files into docker, use local .m2 cache, or skipping building of already built web-console by -Dweb.console.skip). This is based on a full understanding of how Druid is built. To build a official image or in any CICD pipeline, turning this to false should not be an option.

This option should be retired in the future as the web-console module can benefit from the docker cache now.

@kgyrtkirk
Copy link
Member

I believe we see some things differently

why doesn't the jdk matter if the BUILD_FROM_SOURCE is changed to false?
I think its not taht important as long as you are using a decent one which is supported

  • jdk8 is not supported anymore - that's why we are able to use the release option
  • jdk11 reached eol in 2024 and is not really supported anymore; it should probably be removed
  • jdk17 should work well
  • the jdk21 issue I know of is runtime specific; so I doubt the compiler would be affected - if its something else; please let me know

I think this is mostly a workaround which adds complexity because:

  1. some hardware issues due to using Apple M1 is not supported for node-sass
  2. web-console's incremental build is broken; so maven can't skip it correctly
  3. web-console depends on node-sass (but doesn't use it AFAIK)

I understand you can't fix 1 - but instead of fixing 2 or 3 you want to add complexity to the distribution build.
Having 2 is bad - fixing the incremental build of web-console would be beneficial in everyday developer work as well ; as it keeps popping up in builds when someone forgets to exclude that module from the build.
I don't know what would be the option to fix 3 ; but that would be better as well ...so less outdated stuff would be needed.

I believe if you want to build the distribution in Docker that's a different thing/task/etc; it should not be an integral part of building the Docker images of the project.
If that would be a way to build the dist build - I guess that would be probably a more reliable way to build the release as well - as doing so would ensure that the same classes would be packaged into both the binary distribution and into the docker images.

The current CI system uses github-actions which have its own ideas/ways to set things up - so Docker is not used there right now.

@FrankChen021
Copy link
Member Author

  • jdk8 is not supported anymore - that's why we are able to use the release option
  • jdk11 reached eol in 2024 and is not really supported anymore; it should probably be removed
  • jdk17 should work well
  • the jdk21 issue I know of is runtime specific; so I doubt the compiler would be affected - if its something else; please let me know

Here're I think you're talking about the suitable JDK for Druid. However, we are not only working on Druid, but also other projects that uses different JDKs

@kgyrtkirk
Copy link
Member

I think 3rd party users are usually download the released binary artifact in the *.tar.gz form.
They will do that also if they are building a docker image....

So in the scope of this docker image: why that should at all use an internally built artifact - if the usual practice and recommended approach when a release is done to have a jdk installed on the local developer machine?

Throwing in maven-build-cache-extension to the web-console module kinda makes the incremental build problem of the web-console non existent...it will still built once to populate the cache...
but then it will be just skipping it pretty quickly - so if you are able to build (and wait) the web-console module on you M1 mac at least once - then I think you'll be able to use the normal flow to produce the docker image in reasonable time.

https://github.com/kgyrtkirk/druid/tree/build-cache-try

@FrankChen021
Copy link
Member Author

I think 3rd party users are usually download the released binary artifact in the *.tar.gz form. They will do that also if they are building a docker image....

So in the scope of this docker image: why that should at all use an internally built artifact - if the usual practice and recommended approach when a release is done to have a jdk installed on the local developer machine?

For those deploys Druid in their private network/cloud, building a dock image directly instead of building a tarball is straightforward.

Throwing in maven-build-cache-extension to the web-console module kinda makes the incremental build problem of the web-console non existent...it will still built once to populate the cache... but then it will be just skipping it pretty quickly - so if you are able to build (and wait) the web-console module on you M1 mac at least once - then I think you'll be able to use the normal flow to produce the docker image in reasonable time.

https://github.com/kgyrtkirk/druid/tree/build-cache-try

If it works, I think we can add it the web-console module. (But using of theseJava toolchains for front-end project I don't think is a good way).

@kgyrtkirk
Copy link
Member

If it works, I think we can add it the web-console module.

yes - it works; you could checkout the branch and check it - I've uploaded the branch to make it easier for you to try it.

But using of theseJava toolchains for front-end project I don't think is a good way

it has some downsides (like the unconditional rebuild) - but it makes it part of the main build job ; which I think is very usefull.
Do you have some idea in mind which would make it better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants