Skip to content

Knowledge share: SecureDrop Continuous Integration

Erik Moeller edited this page Jun 29, 2021 · 4 revisions

Knowledge Share, 2021-06-28

Notes primarily taken by @creviera

the problem:

  1. ci has been a frequent pain point in terms of flakiness, reliability, and performance
  2. renew collaboration with infra

staging-test-with-rebase (59 mins, our longest running ci job)

current state: we are running the staging VM environment for the app and mon servers, performing a clean install (pulls in ansible code for admins), building the debian packages, and installing them in the VMs. also fetchs app-test artifacts. we are basically saying that every PR (actually, every commit!) should run it.

  • do we need to run this if the ansible code didn't change? - yes, because app armor rules need to be checked
  • how does it work? - when circleci job starts, we run the vagrant-based VM setup on google cloud to make sure the VM setup used by developers still works. (there was another reason that conor mentioned that i missed) - ignores i18n- branches - develop/devops/gce-nested/gce-start.sh: script provisions the cloud vms. explicitly pins ci-nested-virt-buster-IMAGE_NUMBER and pre-fetches the VM images so we don't waste more wall time when we run vagrant up
  • should we run this on hardware?
  • when do we need to run this? [focus on this] - we don't need to run this on every commit
  • this is a pain during the release process

more on how this works:

  • we rebaseontarget to make sure it's run against the latest develop branch, then run staging tests on GCE (see ci-go script, which is a wrapper for everything: gce-start.sh, gce-runner.sh, gce-stop.sh), and then destroy all our securedrop-ci tagged VMs so that we can use a cron job (over in infra) to pull for VMs with this label that have been running for longer than 6 hours and destroys them so we're not charged a bunch of $ for it

    • gce-runner.sh does SSH bootstrapping, (lines 60-61) we make build-debs-notest and make staging (probably takes ~30 minutes to provision the system), then we verify the state of our provisioned VMs

    • after the GCE/GCP run, there's a brief step to extract test results in a machine-readable format.

    • after test results are stoled, the environment will be torn down completely, regardless of pass or fail.

questions:

  • should we move away from circleci now that github has nested virtualization support?

    • we should look into shaving some time off from running google cloud platform (gcp)
  • should we use circleci orbs for filtering for when a job should be run?

    • could we look at diffs and determine what should be run?

    • file and branch filtering already helps us determine what we should run. how would orbs improve this? it might be more maintainable, but perhaps not a performance improvement

    • we're still mostly interested in performance improvements so research is needed to see if it actually shaves off time with environmental setup

  • could someone give some background info on circleci/ vs cimg/ images?

    • cimg/ images are newer circleci-maintained images, which we should be using
  • one area for infra to dig into?

    • we're losing a lot of time on container builds (in .circleci/config.yml)

    • we could also combine some of these test steps, e.g. is there a reason to run make build-debs-notest on circleci and gcp as well?

    • the debs do not have a commit hash appended - just the version.

    • we could start building debs in nightlies and then ci could pull from apt-test, like we do in securedrop workstation land. orrr just build them once in a job and share them.

admin-tests () and app-tests (40 mins, 3 parallel runs)

current state: app-tests is parallelized via --split-by=timings in the .circleci/config.yml (line 107) - there might be a more sophisticated way to parallelize - this is the only place in ci where we use the parallelism tag (is this correct?) - lint is taking an unexpectedly long time, probably because of environmental setup - parallelism: 20 for translation-tests - we need to bump this up each time we add a new language - if a branch is prefixed with i18n- then this ci job is run - there is a devops/scripts script somewhere that determines if the translation-tests is run - instead, you can use circleci, which could save ~5 minutes to figure out if this should be run

Clone this wiki locally