Skip to content

Standalone test-controller#1353

Draft
bloodearnest wants to merge 5 commits intomainfrom
test-controller-standalone
Draft

Standalone test-controller#1353
bloodearnest wants to merge 5 commits intomainfrom
test-controller-standalone

Conversation

@bloodearnest
Copy link
Copy Markdown
Member

@bloodearnest bloodearnest commented Mar 5, 2026

Add a way to run a simple test controller

Runs both webapi and service, with a simple cli interface, in order to make it
easy todo integration testing in backend-server and other places.

The main points of interest:

  • it runs both webapi and controller in the same container
  • it ships the test-controller script in the docker image so it can be easily used elsewhere
  • we use it in job-runner's own functional tests, so we are sure it's working

@bloodearnest bloodearnest force-pushed the test-controller-standalone branch 2 times, most recently from c98b87f to 46eb5fd Compare March 5, 2026 14:18
@bloodearnest bloodearnest marked this pull request as draft March 6, 2026 13:39
Copy link
Copy Markdown
Contributor

@rebkwok rebkwok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I think I'm in favour of this. It certainly makes the functional test set up, and will make backend server setup simpler. Added some comments about additional documentation.

This means that we can run a real job on a real backend, right? So in TPP for example, we'd run the test-controller with backend tpp and the real agent's token. Then we'd add a job on the controller. How does the real agent pick it up? It's running pointing at the real controller endpoint. In the functional tests, we run a test agent too, which is pointing at the right controller endpoint. Or is there a step missing in the docs and to run on a real backend you'd also run a test agent?

If it is running with the real agent, how do we make sure that it isn't sending job data that job-server is going to pick up and complain about? (Maybe I've misunderstood, and this is intended to be separate from the production agent)

I also can't get the functional test to run locally, with this branch I get a connection error:

...
test-agent-1  | During handling of the above exception, another exception occurred:
test-agent-1  | 
test-agent-1  | Traceback (most recent call last):
test-agent-1  |   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
test-agent-1  |     return _run_code(code, main_globals, None,
test-agent-1  |   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
test-agent-1  |     exec(code, run_globals)
test-agent-1  |   File "/app/agent/service.py", line 42, in <module>
test-agent-1  |     main()
test-agent-1  |   File "/app/agent/service.py", line 36, in main
test-agent-1  |     agent_main()
test-agent-1  |   File "/app/agent/main.py", line 29, in main
test-agent-1  |     active_tasks = handle_tasks(api)
test-agent-1  |   File "/app/agent/main.py", line 38, in handle_tasks
test-agent-1  |     active_tasks = task_api.get_active_tasks()
test-agent-1  |   File "/app/agent/task_api.py", line 40, in get_active_tasks
test-agent-1  |     agent_tasks = get_json("tasks/")["tasks"]
test-agent-1  |   File "/app/agent/task_api.py", line 17, in get_json
test-agent-1  |     return request_json("GET", path)
test-agent-1  |   File "/app/agent/task_api.py", line 29, in request_json
test-agent-1  |     response = session.request(method, url, data=data, headers=headers)
test-agent-1  |   File "/opt/venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
test-agent-1  |     resp = self.send(prep, **send_kwargs)
test-agent-1  |   File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 388, in instrumented_send
test-agent-1  |     raise exception.with_traceback(exception.__traceback__)
test-agent-1  |   File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 323, in instrumented_send
test-agent-1  |     result = wrapped_send(
test-agent-1  |   File "/opt/venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
test-agent-1  |     r = adapter.send(request, **kwargs)
test-agent-1  |   File "/opt/venv/lib/python3.10/site-packages/requests/adapters.py", line 677, in send
test-agent-1  |     raise ConnectionError(e, request=request)
test-agent-1  | requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-controller', port=8000): Max retries exceeded with url: /test/tasks/ (Caused by NewConnectionError("HTTPConnection(host='test-controller', port=8000): Failed to establish a new connection: [Errno 111] Connection refused"))
test-controller-1  | 2026-03-09 09:36:16.386Z created new db at /tmp/test-controller-workdir/db.sqlite 
test-controller-1  | Dummy controller started with:
test-controller-1  | BACKEND=test
test-controller-1  | CONTROLLER_TASK_API_TOKEN=pass
test-controller-1  | CONTROLLER_TASK_API_ENDPOINT=http://localhost:8000/
test-controller-1  | [2026-03-09 09:36:16 +0000] [7] [INFO] Starting gunicorn 25.1.0
test-controller-1  | [2026-03-09 09:36:16 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
test-controller-1  | [2026-03-09 09:36:16 +0000] [7] [INFO] Using worker: sync
test-controller-1  | [2026-03-09 09:36:16 +0000] [32] [INFO] Booting worker with pid: 32
test-controller-1  | [2026-03-09 09:36:16 +0000] [7] [ERROR] Control server error: [Errno 13] Permission denied
test-controller-1  | [2026-03-09 09:36:16 +0000] [32] [INFO] Worker spawned (pid: 32)
test-controller-1  | [2026-03-09 09:36:16 +0000] [34] [INFO] Booting worker with pid: 34
test-controller-1  | [2026-03-09 09:36:16 +0000] [34] [INFO] Worker spawned (pid: 34)
test-controller-1  | [2026-03-09 09:36:16 +0000] [37] [INFO] Booting worker with pid: 37
test-controller-1  | [2026-03-09 09:36:16 +0000] [37] [INFO] Worker spawned (pid: 37)
test-controller-1  | [2026-03-09 09:36:16 +0000] [39] [INFO] Booting worker with pid: 39
test-controller-1  | [2026-03-09 09:36:16 +0000] [39] [INFO] Worker spawned (pid: 39)
test-controller-1  | [2026-03-09 09:36:16 +0000] [41] [INFO] Booting worker with pid: 41
test-controller-1  | [2026-03-09 09:36:16 +0000] [41] [INFO] Worker spawned (pid: 41)
test-controller-1  | 2026-03-09 09:36:16.763Z ctrl controller.service started 
test-controller-1  | 2026-03-09 09:36:16.763Z ctrl controller.service ticks disabled 
test-controller-1  | 2026-03-09 09:36:16.763Z ctrl jobrunner.run loop started 
test-controller-1  | 2026-03-09 09:36:21.051Z ctrl Job executing on the backend status=StatusCode.INITIATED workspace=test action=generate_dataset id=oj4cwy6xliqbgdpp
error: Recipe `functional-test` failed with exit code 1


The production image provides a `test-controller` command for running a dummy
controller (web API + controller loop) with ephemeral local state. It is used
in this repo's functional tests, but also can be used by other code than need
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in this repo's functional tests, but also can be used by other code than need
in this repo's functional tests, but also can be used by other code that needs

The command requires exactly two arguments:

1. backend name
2. token value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have various tokens in env variables, might be worth specifying which one we need here

@@ -0,0 +1,56 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add some more documentation into this script itself (but I know this is still draft so maybe that was a plan for later, if we decide to go ahead with it). Thinking of similar scripts that allow running jobs in real backends, like the ehrql debug one

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally worth adding some more documentation, will do.

The primary purpose of this script is not to run it in real production backend. It's to run it in test environments, like the backend-server tests, and testing any AMI images we build actually work. When we moved to controller/agent split, the lack of ability to run a test job easily meant that the backend-server tests were significantly weakened. This has caused me issues with NIoT, as config mistakes that the tests would have caught if they'd run a job had to be hit in the real NIoT environment. And I consider it essential for reliably publishing AMI images.

The only scenario I anticipate it being useful in a real backend is when we're:
a) manually bootstrapping something (i.e. not AMI image based deployment)
b) we don't have an external controller to talk to (staging or real).

NIoT's deployment happens to hit that situation, but I don't expect EMIS or any other backend to.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this makes sense. Maybe just some words about that in the docs then? The fact that the script is included in the image implies that it might be used in production.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, when I run the functional test locally, the test-controller container doesn't get removed. Should it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will add some words. Something along the lines that test-controller script is shipped in the prod image so can be used wholesale, but is not intended to be used in production.

If we really wanted to, I guess we could publish an alternate image with the script, but I don't think its worth it?

I'll look into the left over container, I hadn't spotted that, thanks.

The test script can be used as docker CMD, and takes a backend name and
token, and then sets up an ephemeral controller that you can point
a local agent to. It runs both the webapp and the service, using the
same run-one-constantly technique we use in airlock.

It overrides a bunch of configuration to make it a simpler and ephemeral
controller.

You can then use cli.add_job as normal to add a job to the controller.
This simplifies our docker-compose set up, and runs a little faster too.

The motivation for test-controller script was be easier to run a test
controller setup in other projects (e.g. backend-server, or manually
testing backends), so it's important that job-runner also uses it, so
that we can detect breakages and fix them at source.
This makes it easy to work with ephemeral test backends. You can still
override with <BACKEND>_MAX_{DB_}_WORKERS as normal.

I'm not 100% sure we should have per-backend default workers, tbh,
I think it might be better to require explicit config. The failure stat
would be to still be able to run jobs, but not as many. However, this PR
is about test controller, so I'm punting on that.
@bloodearnest bloodearnest force-pushed the test-controller-standalone branch from 46eb5fd to 6896112 Compare March 9, 2026 10:26
Copy link
Copy Markdown
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me too. I was worried this might end up being more invasive but the actual prod changes are pretty tiny.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants