-
Notifications
You must be signed in to change notification settings - Fork 7
[Feature]: Refactor trust hub communication to outbound-only #220
Copy link
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
The current hub-trust communication model requires the central hub to make inbound HTTP requests to trust APIs (for cohort queries, imaging project creation, health checks, etc.). This creates several problems:
- Firewall constraints: NHS trusts sit behind restrictive firewalls. Exposing trust API ports for inbound connections requires firewall rules, TLS certificate management, and increases the attack surface.
- Certificate management burden: Each trust needs TLS certificates generated, distributed, and renewed (
trust/certs/generate-trust-certs.sh). The hub must trust each certificate. - Fragile connectivity: If the hub can't reach a trust (network blip, trust restart), the request fails immediately with no retry mechanism. Tasks are lost.
- Scaling difficulty: Adding a new trust requires configuring its endpoint URL, port, and certificates on the hub side, plus opening firewall rules.
Solution
Replace the inbound request model with a trust-initiated outbound polling architecture:
- Trusts poll the hub for pending tasks over HTTPS (
GET /tasks/{trust_name}/pending) - The hub queues tasks in a
trust_taskdatabase table (PostgreSQL as FIFO buffer) - Trusts report results back to the hub (
POST /tasks/{trust_name}/{task_id}/result) - Trusts send heartbeats to replace hub-initiated health checks
- A scheduled job recovers tasks stuck in
IN_PROGRESS(stale task recovery)
All communication is outbound from the trust — no inbound ports, no certificates, no firewall rules needed at the trust site.
Changes
Core architecture (flip-api)
- Task queue model: New
TrustTasktable with fields for task type, payload, status, result, retry count, and post-processing flag - Task dispatch endpoint:
GET /tasks/{trust_name}/pending— returns pending tasks and marks themIN_PROGRESS - Result submission endpoint:
POST /tasks/{trust_name}/{task_id}/result— with trust ownership verification to prevent cross-trust spoofing - Heartbeat endpoint:
POST /trust/{trust_name}/heartbeat— replaces hub-initiated health checks - Stale task recovery: Scheduled job resets stuck
IN_PROGRESStasks back toPENDING, with a retry limit (TASK_MAX_RETRIES=3) to prevent poison task loops - Imaging post-processing: Credential emails and status persistence run after
CREATE_IMAGINGtask completion, with automatic retry on failure - Task types:
COHORT_QUERY,CREATE_IMAGING,DELETE_IMAGING,GET_IMAGING_STATUS,REIMPORT_STUDIES,UPDATE_USER_PROFILE
Trust-side (trust-api)
- Task poller: Background async loop polls hub, dispatches to handlers, reports results with retry/backoff
- Task handlers: One handler per task type, replacing the old inbound REST endpoints
- Removed: Inbound
/cohortand/imagingrouter endpoints, TLS certificate generation scripts
Security hardening
- Trust ownership check on result submission (403 if task doesn't belong to claiming trust)
max_lengthvalidation on result payloads (10MB) to prevent database bloat- Retry count on
TrustTaskto prevent poison tasks from looping indefinitely - Safe parsing of imaging results with explicit field validation (prevents
KeyError)
Imaging & UI fixes
- Filter
GET_IMAGING_STATUSresults byxnat_project_idto show correct import status - Handle trusts without XNAT projects (optional schema fields)
- Fix UI error messages for queued (not yet created) imaging projects
- Accept 2xx range (not just 200) for trust status checks
Infrastructure cleanup
- Removed
TRUST_API_PORTreferences from compose files, Terraform, and Ansible (port no longer exposed) - Removed
TRUST_CA_BUNDLEreferences and certificate volume mounts - Removed
trust/certs/generate-trust-certs.sh - Simplified on-premises trust provisioning (Ansible)
- Added
flip-xnat-added-to-projectSES email template for existing users added to imaging projects
Testing
- 664 unit tests passing in
flip-api(new tests for trust_tasks, stale_task_recovery, imaging_notifications, image_services) - 34 unit tests passing in
trust-api(new tests for task_poller, task_handlers) - Covers: task dispatch, result submission, ownership verification, retry limits, post-processing, email notifications, stale recovery, trust-side polling and handler dispatch
Configuration
New environment variables:
| Variable | Default | Description |
|---|---|---|
TRUST_NAME |
— | Trust identity for polling (must match hub DB) |
POLL_INTERVAL_SECONDS |
5 |
How often the trust polls the hub |
TASK_STALE_TIMEOUT_MINUTES |
30 |
Time before an IN_PROGRESS task is considered stale |
TASK_MAX_RETRIES |
3 |
Max stale recovery retries before marking FAILED |
SCHEDULER_STALE_TASK_RECOVERY_RATE |
10 |
Minutes between stale task recovery runs |
TRUST_NAMES |
— | Allowlist of trust names to seed in hub DB |
Migration notes
- The
trust_tasktable is new — will be created by SQLModel on startup - Trust services no longer need inbound firewall rules or TLS certificates
- On-premises trusts need
TRUST_NAMEandCENTRAL_HUB_API_URLconfigured - The hub no longer needs trust endpoint URLs — trusts self-register via polling
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request