Skip to content

[Feature]: Refactor trust hub communication to outbound-only #220

@atriaybagur

Description

@atriaybagur

Problem

The current hub-trust communication model requires the central hub to make inbound HTTP requests to trust APIs (for cohort queries, imaging project creation, health checks, etc.). This creates several problems:

  1. Firewall constraints: NHS trusts sit behind restrictive firewalls. Exposing trust API ports for inbound connections requires firewall rules, TLS certificate management, and increases the attack surface.
  2. Certificate management burden: Each trust needs TLS certificates generated, distributed, and renewed (trust/certs/generate-trust-certs.sh). The hub must trust each certificate.
  3. Fragile connectivity: If the hub can't reach a trust (network blip, trust restart), the request fails immediately with no retry mechanism. Tasks are lost.
  4. Scaling difficulty: Adding a new trust requires configuring its endpoint URL, port, and certificates on the hub side, plus opening firewall rules.

Solution

Replace the inbound request model with a trust-initiated outbound polling architecture:

  • Trusts poll the hub for pending tasks over HTTPS (GET /tasks/{trust_name}/pending)
  • The hub queues tasks in a trust_task database table (PostgreSQL as FIFO buffer)
  • Trusts report results back to the hub (POST /tasks/{trust_name}/{task_id}/result)
  • Trusts send heartbeats to replace hub-initiated health checks
  • A scheduled job recovers tasks stuck in IN_PROGRESS (stale task recovery)

All communication is outbound from the trust — no inbound ports, no certificates, no firewall rules needed at the trust site.

Changes

Core architecture (flip-api)

  • Task queue model: New TrustTask table with fields for task type, payload, status, result, retry count, and post-processing flag
  • Task dispatch endpoint: GET /tasks/{trust_name}/pending — returns pending tasks and marks them IN_PROGRESS
  • Result submission endpoint: POST /tasks/{trust_name}/{task_id}/result — with trust ownership verification to prevent cross-trust spoofing
  • Heartbeat endpoint: POST /trust/{trust_name}/heartbeat — replaces hub-initiated health checks
  • Stale task recovery: Scheduled job resets stuck IN_PROGRESS tasks back to PENDING, with a retry limit (TASK_MAX_RETRIES=3) to prevent poison task loops
  • Imaging post-processing: Credential emails and status persistence run after CREATE_IMAGING task completion, with automatic retry on failure
  • Task types: COHORT_QUERY, CREATE_IMAGING, DELETE_IMAGING, GET_IMAGING_STATUS, REIMPORT_STUDIES, UPDATE_USER_PROFILE

Trust-side (trust-api)

  • Task poller: Background async loop polls hub, dispatches to handlers, reports results with retry/backoff
  • Task handlers: One handler per task type, replacing the old inbound REST endpoints
  • Removed: Inbound /cohort and /imaging router endpoints, TLS certificate generation scripts

Security hardening

  • Trust ownership check on result submission (403 if task doesn't belong to claiming trust)
  • max_length validation on result payloads (10MB) to prevent database bloat
  • Retry count on TrustTask to prevent poison tasks from looping indefinitely
  • Safe parsing of imaging results with explicit field validation (prevents KeyError)

Imaging & UI fixes

  • Filter GET_IMAGING_STATUS results by xnat_project_id to show correct import status
  • Handle trusts without XNAT projects (optional schema fields)
  • Fix UI error messages for queued (not yet created) imaging projects
  • Accept 2xx range (not just 200) for trust status checks

Infrastructure cleanup

  • Removed TRUST_API_PORT references from compose files, Terraform, and Ansible (port no longer exposed)
  • Removed TRUST_CA_BUNDLE references and certificate volume mounts
  • Removed trust/certs/generate-trust-certs.sh
  • Simplified on-premises trust provisioning (Ansible)
  • Added flip-xnat-added-to-project SES email template for existing users added to imaging projects

Testing

  • 664 unit tests passing in flip-api (new tests for trust_tasks, stale_task_recovery, imaging_notifications, image_services)
  • 34 unit tests passing in trust-api (new tests for task_poller, task_handlers)
  • Covers: task dispatch, result submission, ownership verification, retry limits, post-processing, email notifications, stale recovery, trust-side polling and handler dispatch

Configuration

New environment variables:

Variable Default Description
TRUST_NAME Trust identity for polling (must match hub DB)
POLL_INTERVAL_SECONDS 5 How often the trust polls the hub
TASK_STALE_TIMEOUT_MINUTES 30 Time before an IN_PROGRESS task is considered stale
TASK_MAX_RETRIES 3 Max stale recovery retries before marking FAILED
SCHEDULER_STALE_TASK_RECOVERY_RATE 10 Minutes between stale task recovery runs
TRUST_NAMES Allowlist of trust names to seed in hub DB

Migration notes

  • The trust_task table is new — will be created by SQLModel on startup
  • Trust services no longer need inbound firewall rules or TLS certificates
  • On-premises trusts need TRUST_NAME and CENTRAL_HUB_API_URL configured
  • The hub no longer needs trust endpoint URLs — trusts self-register via polling

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions