Conversation
c98b87f to
46eb5fd
Compare
rebkwok
left a comment
There was a problem hiding this comment.
Mostly I think I'm in favour of this. It certainly makes the functional test set up, and will make backend server setup simpler. Added some comments about additional documentation.
This means that we can run a real job on a real backend, right? So in TPP for example, we'd run the test-controller with backend tpp and the real agent's token. Then we'd add a job on the controller. How does the real agent pick it up? It's running pointing at the real controller endpoint. In the functional tests, we run a test agent too, which is pointing at the right controller endpoint. Or is there a step missing in the docs and to run on a real backend you'd also run a test agent?
If it is running with the real agent, how do we make sure that it isn't sending job data that job-server is going to pick up and complain about? (Maybe I've misunderstood, and this is intended to be separate from the production agent)
I also can't get the functional test to run locally, with this branch I get a connection error:
...
test-agent-1 | During handling of the above exception, another exception occurred:
test-agent-1 |
test-agent-1 | Traceback (most recent call last):
test-agent-1 | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
test-agent-1 | return _run_code(code, main_globals, None,
test-agent-1 | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
test-agent-1 | exec(code, run_globals)
test-agent-1 | File "/app/agent/service.py", line 42, in <module>
test-agent-1 | main()
test-agent-1 | File "/app/agent/service.py", line 36, in main
test-agent-1 | agent_main()
test-agent-1 | File "/app/agent/main.py", line 29, in main
test-agent-1 | active_tasks = handle_tasks(api)
test-agent-1 | File "/app/agent/main.py", line 38, in handle_tasks
test-agent-1 | active_tasks = task_api.get_active_tasks()
test-agent-1 | File "/app/agent/task_api.py", line 40, in get_active_tasks
test-agent-1 | agent_tasks = get_json("tasks/")["tasks"]
test-agent-1 | File "/app/agent/task_api.py", line 17, in get_json
test-agent-1 | return request_json("GET", path)
test-agent-1 | File "/app/agent/task_api.py", line 29, in request_json
test-agent-1 | response = session.request(method, url, data=data, headers=headers)
test-agent-1 | File "/opt/venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
test-agent-1 | resp = self.send(prep, **send_kwargs)
test-agent-1 | File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 388, in instrumented_send
test-agent-1 | raise exception.with_traceback(exception.__traceback__)
test-agent-1 | File "/opt/venv/lib/python3.10/site-packages/opentelemetry/instrumentation/requests/__init__.py", line 323, in instrumented_send
test-agent-1 | result = wrapped_send(
test-agent-1 | File "/opt/venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
test-agent-1 | r = adapter.send(request, **kwargs)
test-agent-1 | File "/opt/venv/lib/python3.10/site-packages/requests/adapters.py", line 677, in send
test-agent-1 | raise ConnectionError(e, request=request)
test-agent-1 | requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-controller', port=8000): Max retries exceeded with url: /test/tasks/ (Caused by NewConnectionError("HTTPConnection(host='test-controller', port=8000): Failed to establish a new connection: [Errno 111] Connection refused"))
test-controller-1 | 2026-03-09 09:36:16.386Z created new db at /tmp/test-controller-workdir/db.sqlite
test-controller-1 | Dummy controller started with:
test-controller-1 | BACKEND=test
test-controller-1 | CONTROLLER_TASK_API_TOKEN=pass
test-controller-1 | CONTROLLER_TASK_API_ENDPOINT=http://localhost:8000/
test-controller-1 | [2026-03-09 09:36:16 +0000] [7] [INFO] Starting gunicorn 25.1.0
test-controller-1 | [2026-03-09 09:36:16 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
test-controller-1 | [2026-03-09 09:36:16 +0000] [7] [INFO] Using worker: sync
test-controller-1 | [2026-03-09 09:36:16 +0000] [32] [INFO] Booting worker with pid: 32
test-controller-1 | [2026-03-09 09:36:16 +0000] [7] [ERROR] Control server error: [Errno 13] Permission denied
test-controller-1 | [2026-03-09 09:36:16 +0000] [32] [INFO] Worker spawned (pid: 32)
test-controller-1 | [2026-03-09 09:36:16 +0000] [34] [INFO] Booting worker with pid: 34
test-controller-1 | [2026-03-09 09:36:16 +0000] [34] [INFO] Worker spawned (pid: 34)
test-controller-1 | [2026-03-09 09:36:16 +0000] [37] [INFO] Booting worker with pid: 37
test-controller-1 | [2026-03-09 09:36:16 +0000] [37] [INFO] Worker spawned (pid: 37)
test-controller-1 | [2026-03-09 09:36:16 +0000] [39] [INFO] Booting worker with pid: 39
test-controller-1 | [2026-03-09 09:36:16 +0000] [39] [INFO] Worker spawned (pid: 39)
test-controller-1 | [2026-03-09 09:36:16 +0000] [41] [INFO] Booting worker with pid: 41
test-controller-1 | [2026-03-09 09:36:16 +0000] [41] [INFO] Worker spawned (pid: 41)
test-controller-1 | 2026-03-09 09:36:16.763Z ctrl controller.service started
test-controller-1 | 2026-03-09 09:36:16.763Z ctrl controller.service ticks disabled
test-controller-1 | 2026-03-09 09:36:16.763Z ctrl jobrunner.run loop started
test-controller-1 | 2026-03-09 09:36:21.051Z ctrl Job executing on the backend status=StatusCode.INITIATED workspace=test action=generate_dataset id=oj4cwy6xliqbgdpp
error: Recipe `functional-test` failed with exit code 1
|
|
||
| The production image provides a `test-controller` command for running a dummy | ||
| controller (web API + controller loop) with ephemeral local state. It is used | ||
| in this repo's functional tests, but also can be used by other code than need |
There was a problem hiding this comment.
| in this repo's functional tests, but also can be used by other code than need | |
| in this repo's functional tests, but also can be used by other code that needs |
| The command requires exactly two arguments: | ||
|
|
||
| 1. backend name | ||
| 2. token value |
There was a problem hiding this comment.
We have various tokens in env variables, might be worth specifying which one we need here
| @@ -0,0 +1,56 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
It'd be good to add some more documentation into this script itself (but I know this is still draft so maybe that was a plan for later, if we decide to go ahead with it). Thinking of similar scripts that allow running jobs in real backends, like the ehrql debug one
There was a problem hiding this comment.
Totally worth adding some more documentation, will do.
The primary purpose of this script is not to run it in real production backend. It's to run it in test environments, like the backend-server tests, and testing any AMI images we build actually work. When we moved to controller/agent split, the lack of ability to run a test job easily meant that the backend-server tests were significantly weakened. This has caused me issues with NIoT, as config mistakes that the tests would have caught if they'd run a job had to be hit in the real NIoT environment. And I consider it essential for reliably publishing AMI images.
The only scenario I anticipate it being useful in a real backend is when we're:
a) manually bootstrapping something (i.e. not AMI image based deployment)
b) we don't have an external controller to talk to (staging or real).
NIoT's deployment happens to hit that situation, but I don't expect EMIS or any other backend to.
There was a problem hiding this comment.
OK, this makes sense. Maybe just some words about that in the docs then? The fact that the script is included in the image implies that it might be used in production.
There was a problem hiding this comment.
Oh, also, when I run the functional test locally, the test-controller container doesn't get removed. Should it?
There was a problem hiding this comment.
Yeah, will add some words. Something along the lines that test-controller script is shipped in the prod image so can be used wholesale, but is not intended to be used in production.
If we really wanted to, I guess we could publish an alternate image with the script, but I don't think its worth it?
I'll look into the left over container, I hadn't spotted that, thanks.
The test script can be used as docker CMD, and takes a backend name and token, and then sets up an ephemeral controller that you can point a local agent to. It runs both the webapp and the service, using the same run-one-constantly technique we use in airlock. It overrides a bunch of configuration to make it a simpler and ephemeral controller. You can then use cli.add_job as normal to add a job to the controller.
This simplifies our docker-compose set up, and runs a little faster too. The motivation for test-controller script was be easier to run a test controller setup in other projects (e.g. backend-server, or manually testing backends), so it's important that job-runner also uses it, so that we can detect breakages and fix them at source.
Allows standalone use
This makes it easy to work with ephemeral test backends. You can still
override with <BACKEND>_MAX_{DB_}_WORKERS as normal.
I'm not 100% sure we should have per-backend default workers, tbh,
I think it might be better to require explicit config. The failure stat
would be to still be able to run jobs, but not as many. However, this PR
is about test controller, so I'm punting on that.
46eb5fd to
6896112
Compare
evansd
left a comment
There was a problem hiding this comment.
This seems reasonable to me too. I was worried this might end up being more invasive but the actual prod changes are pretty tiny.
Add a way to run a simple test controller
Runs both webapi and service, with a simple cli interface, in order to make it
easy todo integration testing in backend-server and other places.
The main points of interest: