Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/build task overhaul #710

Merged
merged 6 commits into from
Sep 8, 2023
Merged

Conversation

henrjk
Copy link
Member

@henrjk henrjk commented Jun 30, 2023

This changes build tasks to no longer copy build outputs into the docker-context. This is implemented by having Dockerfile's use files from natural build output location(s) and by default have docker-context to be the same as the repository root. This also allows using the same Dockerfile for ods-pipeline as well as local builds.

In addition build task skipping (and the related caching) is enhanced to to allow multiple input and output locations. cache-build.sh and copy-build-if-cached.sh no longer have an output-dir parameter and instead support parameters:

  • --cache-sources: List of source directories (as colon separated string) which influence the build. These directories are relative to the repository root. If the contents in these directories change the cache is invalidated and the build task will rebuild from scratch.
    The default is that this value is empty which is understood as disabling build task skipping and the related caching.
    In the rather typical case where a build tasks' sources match the working-dir, this value should be the same. However if there are files outside the working-dir the value must capture these correctly as otherwise builds may be skipped although they need to run again.

  • --cached-outputs: List of build output directories (as colon separated string) to be cached. These directories are relative to the working-dir parameter.

The build tasks themselves are adjusted accordingly. All build tasks now support parameter --cache-sources with a default "" (empty String) as a task parameter, so that build task skipping is now an opt-in feature.

In addition:

  • npm: no longer copies build outputs to output-dir. The parameters build-dir, copy-node-modules and output-dir which configure that behavior are removed and the build script is simplified. Instead the task now supports the cached-outputs parameter which defaults to dist. When other locations must be cached, the parameter would have to be adjusted. For example to build:node_modules if an associated Dockerfile would copy from these locations.

  • python: no longer copies build outputs. The parameter output-dir was removed. In addition as for Python the actual build is meant to happen during image building no cached-outputs are needed and the task always passes an empty string here. Of course build tasks would still be skipped if one opted in using the cache-sources parameter.

  • go does not support sets cached-outputs on the task level and instead sets the scripts cached-outputs parameter to output-dir.

  • gradle supports the cached-outputs parameter with a default being 'docker'.

Closes #678

Tasks:

  • Updated design documents in docs/design directory or not applicable
  • Updated user-facing documentation in docs directory or not applicable
  • Ran tests (e.g. make test) or not applicable
  • [TODO] Updated changelog or not applicable

@henrjk
Copy link
Member Author

henrjk commented Jul 3, 2023

@michaelsauter, @oalyman as well as anyone else interested, this is ready for feedback.

Copy link
Member

@michaelsauter michaelsauter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay in reviewing this.

After some thought, I believe this is much better than what we had, I really like what you did here 👍

I made some minor inline comments, but generally this looks good to me already.

Regarding the Go / Gradle tasks, I would like to align these as well, but we can do this in a separate PR as well if you prefer.

build/package/scripts/build-npm.sh Outdated Show resolved Hide resolved
--cached-outputs) outputs_str="$2"; shift;;
--cached-outputs=*) outputs_str="${1#*=}";;

--cache-extra-inputs) extra_inputs_str="$2"; shift;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name a bit confusing. Maybe something like --cache-dependencies? Though I am not convinced this is really better. Or maybe just changing the order to --extra-cache-inputs would help? If you do not like any of these suggestions I'm fine with what you propose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment I called now the combination of what previously would have been working-dir and then the cache-extra-inputs as cache-sources. Let me know if you find this confusing or have a worthwhile alternative proposal.

build/package/scripts/cache-build.sh Show resolved Hide resolved
build/package/scripts/cache-build.sh Outdated Show resolved Hide resolved
description: >-
List of build output directories (as colon separated string) to be cached.
These directories are relative to `working-dir`.
If caching is not enabled this value is not used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need the separate toggle cache-build anymore. Its description is a bit outdated now, and I think we could have the caching be based on the presence of the cached-outputs param. If it is the empty string, no caching should happen. If one or more paths are specified, those should be cached.

That may lead to the question what we should have as defaults for this param, and I am leaning towards using the empty string as a default, and making build caching opt in. I feel that without an explicit output-dir, caching is more subtle and it would be good if people read about the feature and understand what has to be done instead of unknowingly using some magic values which may or may not cache everything they need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I decided to remove extra inputs and instead have a cache-sources parameter:

  • List of build source directories (as colon separated string) which influence the build.
    These directories are relative to the repository root.
    If the contents in these directories change the cache is invalidated so that the build task will rebuild from scratch.
    If left empty no caching is enabled. When sources are only in the working-dir, this value can be the same value.

You will have to opt in to build task skipping using parameter cache-sources for all languages.

The cache-outputs parameter can be used to determine the folders inside working-dir to cache which is used as follows in the language tasks:

  • go: not supported, passed into caching scripts with value of output-dir
  • npm: supported with default of 'dist'
  • gradle: supported with default of 'docker'
  • python: not supported, passed into caching scripts as empty value. As a reminder a build may still be skipped if the input doesn't change, but there is no need to cache any built files.

To me this feels quite natural now. What do you think?

@@ -30,11 +30,11 @@ spec:
- name: dockerfile
description: Path to the Dockerfile to build (relative to `docker-dir`).
type: string
default: ./Dockerfile
default: ./docker/Dockerfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to keep ./Dockerfile. I know this is a breaking change but changing the docker-dir already is one.

test/tasks/ods-build-go_test.go Outdated Show resolved Hide resolved
COPY src /node/
COPY node_modules /node/
COPY package.json /node/
COPY package-lock.json /node/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this could be simplified to COPY src node_modules package.json package-lock.json /node/

@henrjk henrjk force-pushed the feature/build-cache-simplification branch from 5354b99 to dd90652 Compare September 1, 2023 15:17
@henrjk henrjk marked this pull request as ready for review September 1, 2023 15:43
@henrjk henrjk changed the title Feature/build cache simplification Feature/build task overhaul Sep 1, 2023
@henrjk henrjk self-assigned this Sep 1, 2023
Copy link
Member

@michaelsauter michaelsauter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it 👍

The only thing I am still struggling with is the naming of the parameters. I do find it a bit unintuitive that changing the "cache-sources" parameter influences whether caching is on or off.

So while reading this PR I was reminded of https://go.dev/blog/rebuild, which deals with a different, but related topic. The article uses the term "relevant input" to describe what influences a build. Reading the ADR in this PR, you actually use very similar language. My conclusion then is that:

  • "input" and "output" seem to be the natural way to think about this
  • "cached outputs" controls what gets cached so it feels more suitable to control whether caching is on or off

My suggestion would be the following:

  • rename "cache-sources" to "build-inputs" or "relevant-inputs" or "relevant-cache-inputs" (leaning towards the latter). Naturally, this should default to the working dir and I would expect it to be changed by users very rarely (e.g. one example we know where this is helpful is for monorepos where one build depends on the other)
  • keep "cached-outputs" including the defaults you chose. However, introduce a special string that disables caching, I would use "disabled" (or maybe something like "!disabled" - I think I like this most). In the case of Python, there currently is no "cached-outputs" param, but I would add it, with the default being the empty string. This would enable caching by default (caching only the artifacts), and allow to disable the caching by setting it to "!disabled". It would also allow users of the Python task with their own script to cache generated files for example.

@henrjk
Copy link
Member Author

henrjk commented Sep 4, 2023

  • "cached outputs" controls what gets cached so it feels more suitable to control whether caching is on or off

Python and other dynamic languages where there is nothing to build have no outputs to cache but use the build for automated tests and thus build skipping makes still sense. I suppose it makes more sense to think and explain this feature with the notion of build skipping.
I myself have sometimes the caching perspective in mind, and this may shine through in the language I used.

Nonetheless looking at it with the sense of build skipping makes much more sense to me and then enabling/disabling with inputs seems okish for me. An alternative could be to bring the build-skipping-enabled flag back and also add the extra-inputs but the way it is proposed now made sense to me as well.

Let me know how this affects your previous comment!

@henrjk
Copy link
Member Author

henrjk commented Sep 4, 2023

  • rename "cache-sources" to "build-inputs" or "relevant-inputs" or "relevant-cache-inputs" (leaning towards the latter). Naturally, this should default to the working dir and I would expect it to be changed by users very rarely (e.g. one example we know where this is helpful is for monorepos where one build depends on the other)

To me "sources" implied that these are build inputs. Nonetheless I also don't mind changing a less generic term and use inputs instead.

Although "relevant-inputs", "relevant-cache-inputs" or "relevant-build-inputs" make sense to me, here is another suggestion:

    - name: build-inputs
      description: |
        List of relevant directories (as colon separated string) with files that influence the build.
        These directories are relative to the repository root.
        If these directories do not change the build is skipped and any files that were 
        previously built are restored as well as the prior build reports. 
        The built files to restore are specified with the "build-outputs" parameter. 
        Otherwise (if build-inputs changed) the build starts from scratch.
        The value "!always-rebuild" means that the build is never skipped.
      type: string
      default: $(parameter.working-dir)

One doubt I have is: can a Task parameter's default be a value of another parameter? If not one would have to use some other magic value or reintroduce the "build-extra-inputs"

@michaelsauter
Copy link
Member

michaelsauter commented Sep 8, 2023

After some time I have to say I did not come to a conclusion that feels completely right to me.

I tend towards using "extra inputs" as a param, as that captures the intend for most cases, and avoids having to align this param with the working dir param.

I am not sure how I would implement the possibility to disable skipping. One thing that came to my mind is that if "extra inputs" would be set to a path that always changes, it would effectively disable skipping. However, I do not know of any file in Unix that one can compute the md5 of that would always change (/dev/(u)random is close but we'd need some special handling to read only part of it). If we would like to follow this line of thought, one could write the name of the pipeline run, which I believe to be unique enough, into the ODS context (so the content of ./ods/pipeline-run would equal the name of the pipeline run). Then setting the "extra inputs" param to .ods/pipeline-run the build would never be skipped.

In summary, feel free to pick what feels best to you now, and we can always fine-tune later.

@henrjk
Copy link
Member Author

henrjk commented Sep 8, 2023

I was pondering about the random suggestion but for now felt that we should instead reintroduce the flag and have extra inputs. Following is the updated draft of the description:

Draft to be merged

This changes build tasks to no longer copy build outputs into the docker-context. This is implemented by having Dockerfile's use files from natural build output location(s) and by default have docker-context to be the same as the repository root. This also allows using the same Dockerfile for ods-pipeline as well as local builds.

In addition build task skipping (and the related caching) is enhanced to to allow multiple input and output locations. cache-build.sh and copy-build-if-cached.sh no longer have an output-dir parameter and instead support parameters:

--build-extra-inputs: List of build source directories (as colon separated string) which in addition to working-dir influence the build.
These directories are relative to the repository root.
If the contents in these directories change the cache is invalidated so that the build task will rebuild from scratch.

--cached-outputs: List of build output directories (as colon separated string) to be cached. These directories are relative to the working-dir parameter.

The build tasks themselves are adjusted accordingly. All build tasks now support parameter --build-extra-inputs with a default "" (empty String) as a task parameter.

In addition:

npm: no longer copies build outputs to output-dir. The parameters build-dir, copy-node-modules and output-dir which configure that behavior are removed and the build script is simplified. Instead the task now supports the cached-outputs parameter which defaults to dist. When other locations must be cached, the parameter would have to be adjusted. For example to build:node_modules if an associated Dockerfile would copy from these locations.

python: no longer copies build outputs. The parameter output-dir was removed. In addition as for Python the actual build is meant to happen during image building no cached-outputs are needed and the task always passes an empty string here. Build tasks are still skipped if the working-dir is unchanged.

go does not support cached-outputs on the task level and instead sets the scripts cached-outputs parameter to output-dir.

gradle supports the cached-outputs parameter with a default being 'docker'.

@henrjk henrjk merged commit 6db04a4 into master Sep 8, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

build tasks streamlining and avoidance of file copies
2 participants