From e9078efa7f0522175283c5d0bdc0a2e0d4b0834f Mon Sep 17 00:00:00 2001 From: Andrew Miller Date: Wed, 22 Apr 2026 17:35:19 -0400 Subject: [PATCH 1/2] docs: add SECURITY.md; link from README Covers the dstack LUKS2 disk-encryption model, HKDF key derivation tied to app_id + instance_id, what lives in each SQLite table, token lifecycle, and Stage 0 limitations. README now points readers there and corrects the env-template filename. Co-Authored-By: Claude Opus 4.7 --- README.md | 6 ++- SECURITY.md | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 146 insertions(+), 1 deletion(-) create mode 100644 SECURITY.md diff --git a/README.md b/README.md index 4ea780f..59b77b5 100644 --- a/README.md +++ b/README.md @@ -124,9 +124,13 @@ proxy/src/ └── scoped-fetch.ts # URL globs, methods, body schema, rate limits dstack/ ├── docker-compose.yml # CVM deployment -└── .env.staging # Env template +└── .env.production # Env template (secrets in gitignored .env.staging) ``` +## Security + +See [SECURITY.md](SECURITY.md) for the full security model: where secrets are stored, how volumes are encrypted (LUKS2 with TEE-derived keys), authentication token lifecycle, data loss scenarios, and known Stage 0 limitations. + ## License MIT diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..e8ab4b4 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,141 @@ +# Security Model + +OAuth3 Enclave runs inside a [dstack](https://docs.phala.network/dstack/overview) Confidential VM (CVM) on Intel TDX hardware. This document explains where secrets are stored, how they're protected, and under what conditions they can be lost. + +## Data at rest + +All Docker volumes inside the CVM sit on a **LUKS2-encrypted disk**: + +- **Cipher:** AES-XTS-plain64, 512-bit key +- **KDF:** PBKDF2 with SHA256 +- **Filesystem:** ZFS on the encrypted device +- **Key source:** derived from dstack KMS, tied to TDX attestation measurements + +The disk encryption key (`disk_crypt_key`) is provisioned automatically at CVM boot. The KMS only releases the key after validating the TDX attestation quote — proving the correct code is running on genuine TEE hardware. The operator never sees this key. + +The key is derived deterministically via HKDF-SHA256: +``` +disk_crypt_key = HKDF(KMS_root_key, [app_id || instance_id || "app-disk-crypt-key"]) +``` + +The `instance_id` includes a random seed generated once per CVM instance and persisted locally. This means: + +- **Same CVM restarting or upgrading:** same seed → same key → data survives +- **Different CVM with same app_id:** different seed → different key → **cannot read the other's data** +- **No key migration or export:** there is no mechanism to share disk keys between instances + +This is single-instance durability, not replicated storage. User secrets are bound to the specific CVM instance. + +**Reference:** `dstack/dstack-util/src/system_setup.rs` (`luks_setup()`, `mount_data_disk()`), `dstack/kms/src/main_service.rs` (key derivation) + +## What is stored where + +### SQLite (`/data/proxy.db` — encrypted volume `oauth3-data`) + +| Table | Contents | Sensitivity | +|---|---|---| +| `secrets` | User-submitted credentials (API keys, cookies), scoped by `owner_id` | **High** — plaintext within the app, encrypted at rest by LUKS2 | +| `sessions` | Session state, bearer tokens, expiry, policies | **High** — bearer tokens grant invoke access | +| `capabilities` | Capability code, signatures, spec hashes | Medium — defines what each permit can do | +| `kv_store` | Persistent key-value state for custom capabilities | Medium — application data | +| `execution_requests` | Pending/completed execution records | Low | +| `scope_grants` | Approved scope grants per session | Low | + +### Postgres (`pgdata` volume — encrypted by same LUKS2 disk) + +Stores **audit logs only** — execution history, session metadata, scope grant records. No secrets. Writes are fire-and-forget and never block the main flow. + +### Environment variables (set at deploy time) + +| Variable | Purpose | Who can see it | +|---|---|---| +| `JWT_SECRET` | HMAC-SHA256 key for signing/verifying JWTs | Operator (deploy-time env var) | +| `PG_PASSWORD` | Internal Postgres password (container-to-container) | Operator (deploy-time env var) | +| `ANTHROPIC_API_KEY` | For LLM-drafted capability specs | Operator (deploy-time env var) | +| `CLOUDFLARE_API_TOKEN` | TLS certificate provisioning via DNS-01 | Operator (deploy-time env var) | + +## Authentication tokens + +| Token | Issued by | Lifetime | Purpose | +|---|---|---|---| +| Owner JWT | Orchestrator or enclave | 1 year | Manage secrets, approve permits | +| Agent JWT | Orchestrator or enclave | 24 hours | Request permits, execute code | +| Bearer token | Enclave (at approval) | 1 year (custom) / 30 min (standard) | Direct capability invocation via `/invoke` | +| API key | Orchestrator | Indefinite | Tenant identity for orchestrator endpoints | +| Magic link | Orchestrator | 15 minutes | Email-based login | + +Callers of `/invoke` use bearer tokens — no JWT knowledge required. + +## Data loss scenarios + +| Event | User secrets | Audit logs | Recovery | +|---|---|---|---| +| `phala deploy` (upgrade) | **Survives** — volumes persist | **Survives** | None needed | +| Container crash + restart | **Survives** — Docker restart policy | **Survives** | Automatic | +| `phala cvms create` (new CVM) | **Lost** — new volumes | **Lost** | Must recreate permits and re-submit secrets | +| CVM host migration | **Lost** — volumes don't migrate | **Lost** | Must recreate | + +## Trust boundaries + +``` +┌─────────────────────────────────────────────────┐ +│ CVM (Intel TDX) │ +│ │ +│ ┌─────────────┐ ┌──────────┐ ┌───────────┐ │ +│ │ oauth3-proxy│ │ postgres │ │ browser │ │ +│ │ (secrets, │ │ (audit │ │ (Playwright│ │ +│ │ sessions, │ │ logs) │ │ + VPN) │ │ +│ │ execution) │ │ │ │ │ │ +│ └──────┬──────┘ └──────────┘ └───────────┘ │ +│ │ │ +│ ┌──────┴──────────────────────────────────┐ │ +│ │ LUKS2-encrypted ZFS volume │ │ +│ │ Key: disk_crypt_key from KMS │ │ +│ │ (tied to TDX attestation) │ │ +│ └─────────────────────────────────────────┘ │ +│ │ +│ dstack-ingress (attested TLS) │ +└─────────────────────────────────────────────────┘ + │ + TLS (Cloudflare DNS-01 cert) + │ +┌────────┴────────┐ ┌──────────────────┐ +│ Orchestrator │ │ Agent / Browser │ +│ (Vercel) │ │ (untrusted) │ +│ - signup │ │ │ +│ - rate limits │ │ │ +│ - dashboard │ │ │ +│ NEVER sees │ │ │ +│ user secrets │ │ │ +└─────────────────┘ └──────────────────┘ +``` + +**The orchestrator is untrusted.** It handles tenant signup, rate limiting, and serves the approval UI — but all secret submission and capability execution goes directly to the enclave. The orchestrator never sees user secrets. + +**The operator** controls what code runs in the CVM (via docker-compose) and can set environment variables. See "Known limitations" below. + +## Known limitations (Stage 0) + +These are gaps that must be closed for [Stage 1 (Dev-Proof)](https://draftv4.erc733.org): + +1. **`JWT_SECRET` is operator-supplied** — the operator could issue owner tokens for any tenant. Should be derived from `DeriveKey("/app/jwt-secret")` so it's TEE-bound. + +2. **No on-chain transparency log** — no public record of what code version was deployed when. Should use Base KMS with DEPLOYMENTS.md tracking compose hashes and on-chain TX links. + +3. **Docker images not pinned by digest in CI** — images are pinned in docker-compose.yml by `@sha256:`, but builds are not reproducible (no `SOURCE_DATE_EPOCH`, base images not pinned). + +4. **Configurable URLs** — `ORCHESTRATOR_URL`, `PUBLIC_URL`, `CORS_ORIGIN` are operator-controlled env vars. A malicious operator could point these at attacker infrastructure. + +5. **Dev fallback in auth.ts** — when `JWT_SECRET` is empty, auth is disabled entirely (open access). This should not be possible in production. + +## Attestation + +The CVM's TDX attestation quote can be fetched from the metadata endpoint: + +```bash +curl https://23da7533b60fe6e5f5e30c97f30af5bd7ccdf4df-8090.dstack-pha-prod9.phala.network/ +``` + +This returns `tcb_info` including the `compose_hash` — a SHA256 of the app configuration (docker-compose + allowed env vars). Third parties can verify this against the source code without needing account access. + +Visual verification: https://trust.phala.com/app/23da7533b60fe6e5f5e30c97f30af5bd7ccdf4df From e85e23aaeab3d775a8cdce1e066e336c5d43476f Mon Sep 17 00:00:00 2001 From: Andrew Miller Date: Wed, 22 Apr 2026 17:31:31 -0400 Subject: [PATCH 2/2] docs: add persistent-demo walkthrough for yt-shorts-v3 Case study of why one permit stayed live for weeks while 41 siblings went stale. Covers the five decoupled ingredients (immutable capability code, year-long bearer, out-of-band cookie refresh, static GitHub Pages UI, untouched enclave), the DB evidence for the cookie-refresh correlation, a clone-this-pattern recipe, and the sidecar SSH debug checklist. Flags the unbuilt distribution path for the companion Chrome extension as an open question. Co-Authored-By: Claude Opus 4.7 --- docs/persistent-demo-walkthrough.md | 155 ++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/persistent-demo-walkthrough.md diff --git a/docs/persistent-demo-walkthrough.md b/docs/persistent-demo-walkthrough.md new file mode 100644 index 0000000..1ed5d24 --- /dev/null +++ b/docs/persistent-demo-walkthrough.md @@ -0,0 +1,155 @@ +# Anatomy of a Persistent Demo + +A case study of one permit (`yt-shorts-v3`) that stayed live and logged-in for weeks while 41 sibling permits on the same enclave went stale within days. This doc reverse-engineers why, and turns the answer into a recipe. + +## The live demo + +A static HTML page on GitHub Pages calls a custom capability on the enclave every five minutes: + +``` +https://account-link.github.io/oauth3-extension-page/query.html#p=yt-shorts-v3&t= +``` + +The page asks the enclave a single question — "Is this user watching YouTube Shorts right now?" — and renders the answer alongside a live list of recent shorts with titles. No backend, no auth state on the page itself, no session to refresh. The full client is 108 lines of vanilla JS. + +Under the hood, the capability hits YouTube's `/feed/history` via a headless Playwright browser running inside the TEE (see [browser/README.md](../browser/README.md)), authenticating with cookies the enclave holds as a secret. It diffs the shorts count against a per-permit KV store, returns `{watching, shortsCount, videosToday, newShorts, shorts[]}`, and leaves. + +## Why it survives + +Five decoupled ingredients. Drop any one and the demo dies in days rather than weeks. + +### 1. The permit's capability code is immutable + +The `/permit` endpoint ([proxy/src/server.ts:454](../proxy/src/server.ts)) stores the capability's JavaScript body as a row in SQLite. You cannot update code on an existing permit — you can expand capabilities, but the executor picks the first match, so the only way to change behavior is to create a new `permit_id`. That constraint turns out to be a feature: the thing agents call is pinned. Nothing downstream can drift it. + +### 2. The bearer token outlives everything else + +Permits created by the custom-plugin flow get an `expires_at` one year out ([proxy/src/database.ts](../proxy/src/database.ts), schema migration). The bearer is 64 hex characters of cryptographic randomness generated at approval time and returned once. `/invoke/:permit_id` authenticates by bearer only — no JWT, no refresh, no session state to rotate. The URL hash `#p=&t=` is the entire credential. + +### 3. Cookies refresh out-of-band — the real answer + +This is the ingredient that distinguishes working demos from dead ones. A companion Chrome extension (source: `Account-Link/oauth3-extension-1`) maintains a live cookie-sync loop for every tracked domain: + +- A `chrome.alarms` alarm named `cookie-sync` fires every 30 minutes. +- A `chrome.cookies.onChanged` listener fires whenever any cookie for a tracked domain mutates (debounced 500 ms). +- Both paths call `uploadCookies(domain)` which POSTs the current cookies to `/cookies/upload` on the enclave. + +The `/cookies/upload` endpoint ([proxy/src/server.ts:150-157](../proxy/src/server.ts)) stores the payload as secret `COOKIES_` scoped to the owner. Each `/invoke` call resolves secrets fresh from the DB before building the capability's endowment ([proxy/src/server.ts:704-721](../proxy/src/server.ts)), so the JS body reads the newest cookies on every call: + +```js +const raw = JSON.parse(secrets.COOKIES_YOUTUBE_COM); +const cookies = raw.cookies; // /cookies/upload wraps as {cookies, user_agent} +``` + +YouTube's auth cookies (`SIDCC`, `__Secure-1PSIDTS`, `__Secure-3PSIDTS`) rotate on every authenticated request. Without continuous refresh they drift out of validity in days. With it, the enclave sees the exact same set Chrome is using right now. + +#### Caveat: the extension only runs in developer mode + +The author of the working demo is also the author of the extension, loaded unpacked from source into one Chrome profile on one laptop. **There is no production install path today.** Specifically: + +- The extension has not been submitted to the Chrome Web Store. It has no publisher verification, no signed `.crx` distribution, no auto-update channel. +- The `manifest.json` declares `host_permissions` only for `https://tee.oauth3-stage.monerolink.com/*`, plus `optional_host_permissions: [""]` that the user has to grant per-domain at runtime. +- To reproduce the demo someone else needs to clone `Account-Link/oauth3-extension-1`, open `chrome://extensions`, enable Developer Mode, and "Load unpacked" against the cloned directory. That is the current install documentation. +- Unpacked extensions get disabled on every Chrome update in some managed environments, and their background service worker can be killed more aggressively than a Web Store extension. Treat the observed "weeks of uptime" as an existence proof on one machine, not a general guarantee. + +The production path — Web Store submission, signed builds, an auto-update URL, and an install flow users can follow without a GitHub account — is unbuilt. The working demo is an N=1 of what happens when the author personally keeps the loop running. That is still informative: it confirms that **given** a healthy refresh loop, the rest of the architecture holds up for weeks. The open question is how to make the refresh loop itself reliably available to anyone else. + +### 4. GitHub Pages as the UI host + +`query.html` is a static file in a separate public repo (`Account-Link/oauth3-extension-page`). All state lives in the URL hash fragment, which never leaves the browser — GitHub's servers see no token. Nothing to deploy, nothing to expire, no CORS surprises. The page is the permit's own mini-frontend and it costs zero operational overhead. + +### 5. The enclave itself hasn't been touched + +The Phala CVM running `oauth3-proxy-staging` has a 26-day uptime at the time of writing, with six containers all `Up 3 weeks`. Volumes persist across redeploys (see [dstack/DEPLOY.md](../dstack/DEPLOY.md)) so DB state would survive a restart anyway. The stable substrate under the whole thing is just... not moving. + +## Evidence: the mortality correlation + +The `secrets` table has an `updated_at` column that bumps on every `/cookies/upload`. A snapshot taken directly from the enclave DB tells the story in one table. + +| Owner | `COOKIES_YOUTUBE_COM` age | Demo status | +|---|---|---| +| the working demo's owner | 27 minutes | live | +| (and `COOKIES_GITHUB_COM`) | 3 minutes | live | +| next-freshest tenant | ~19 days | dead | +| 3 tenants | ~26–30 days | dead | +| 36 more tenants | 36–42 days | dead | + +The freshest non-working tenant is three weeks behind. Everyone else froze the day they last opened Chrome with the extension active. The 27-minute interval on the working owner is exactly one `cookie-sync` alarm period, and the 3-minute GitHub entry is a `chrome.cookies.onChanged` event that fired when they browsed GitHub earlier in the hour. + +This is the full causal chain, empirically: + +``` +Chrome rotates cookies on every YouTube request + │ + ▼ + onChanged listener (debounced) ─┐ + ├──► uploadCookies() ──► POST /cookies/upload ──► secrets.updated_at + cookie-sync alarm (30 min) ─────┘ │ + ▼ + /invoke re-reads secrets on every call + │ + ▼ + capability sees current cookies + │ + ▼ + YouTube treats request as logged-in +``` + +Remove any step and the chain breaks. What actually broke for every dead demo in the table was the top step — the user stopped running the extension, so nothing uploaded, so cookies froze at whatever timestamp the last sync wrote. + +## Recipe: clone this pattern + +To build a similar long-lived demo for a different site: + +1. **Create a custom-plugin permit** with a capability that reads `secrets.COOKIES_` and hands cookies to the browser service. Template in [proxy/src/plugins/custom.ts](../proxy/src/plugins/custom.ts); working example in `yt-testing/setup_short_check.sh` (at project root) which extracts Chrome cookies, creates the permit, approves it, and prints the query URL. Capture the bearer token. + +2. **Install the extension in developer mode** from `Account-Link/oauth3-extension-1`: clone the repo, open `chrome://extensions`, enable Developer Mode, click "Load unpacked" and point it at the clone. Then open the extension's popup, add the target domain to tracked sites, and flip the `syncEnabled` toggle. The extension stores a `trackedSites` list in `chrome.storage.local`; with `syncEnabled: true` on a domain, it calls `/cookies/upload` every 30 minutes and on every cookie change for that domain. See the caveat above — there is no Web Store distribution yet, and the author's single dev-mode install is the only place this loop is currently known to run for weeks. + +3. **Write a static HTML page** that calls `POST /invoke/` with the bearer. Host it wherever — GitHub Pages works well because it's free, CORS-friendly, and invisible to the end user. Pattern: `const { result } = await fetch(...)` in a `setInterval` loop. `query.html` in the extension-page repo is the reference implementation (108 lines). + +4. **Verify the loop is running.** Open the extension's service worker console and check `syncHealth.lastCookieSync` and `trackedSites[].lastUpload` in `chrome.storage.local`. Both should tick forward within 30 minutes of browsing activity. + +## Debugging checklist — is the refresh actually happening? + +The extension-side signals above are useful but the authoritative answer lives in the enclave DB. SSH in through the CVM's sidecar container: + +```bash +cat <<'EOF' | phala ssh -- docker exec -i -w /app dstack-oauth3-proxy-1 node +const db = require('better-sqlite3')('/data/proxy.db', { readonly: true }); +const now = Date.now(); +const rows = db.prepare( + "SELECT name, owner_id, updated_at FROM secrets WHERE name LIKE 'COOKIES_%' ORDER BY updated_at DESC" +).all(); +console.log(JSON.stringify(rows.map(r => ({ + name: r.name, + owner: r.owner_id.slice(0, 8), + age_min: Math.round((now - r.updated_at) / 60000) +})))); +EOF +``` + +Notes on this incantation: + +- `phala ssh` connects you to the `dstack-ssh-1` sidecar, which has access to `/var/run/docker.sock`. The proxy container's `/data` volume is not visible from the sidecar itself — you have to `docker exec` in. +- The proxy image is distroless-ish, but `node` is at `/usr/local/bin/node` and `better-sqlite3` is already installed. No need to install tools. +- The `phala ssh` argv parser strips quote characters, which makes multi-layer shell quoting painful. Feeding the script on stdin through a heredoc (as above) sidesteps the problem entirely. +- What you want to see: your tenant's `age_min` value bouncing around 0–30 on every query. If it grows past 30 and keeps growing, the extension isn't calling `/cookies/upload` anymore, and you have a day or two before the demo dies. + +If the refresh is healthy and the demo still fails, the failure is elsewhere: the cookies themselves may have been invalidated server-side (YouTube "Sign out from all other sessions" etc), or the capability may be hitting a page-structure change (YouTube renames the fields it parses periodically — see [yt_capabilities_2026_03_12.md](../yt_capabilities_2026_03_12.md) for past examples). + +## Generalization + +The recipe isn't about YouTube or about shorts. It's a pattern for long-lived cookie-authenticated agent capabilities: + +- **Secrets that can go stale need a refresh channel separate from the permit itself.** The permit is a pinned, signed, approved thing; the cookies are live material. Conflating the two means every cookie rotation would require re-approval. +- **An out-of-band agent the user already runs is the simplest refresh source.** A browser extension has privileged access to live cookies; uploading them to the enclave on a timer is trivially cheap. +- **One year is about right as a bearer TTL for custom capabilities.** Shorter and you'll see demos die for trivial reasons. Longer and you're asking the enclave to keep a lot of old sessions around. +- **The UI surface can be static.** Any page that accepts a bearer in the URL hash can drive the permit without its own backend. + +The enclave's job in this picture is narrow but essential: hold the bearer, hold the code, hold the cookies, and re-assemble them on every call. The rest is just making sure the cookies stay fresh. + +## Open questions + +- **Distribution of the refresh agent.** The extension works beautifully on the author's machine. Getting it onto other users' machines in a durable way means Web Store submission, signed builds, an auto-update URL, and an onboarding flow — none of which exists yet. This is the gating problem between "an interesting N=1" and "a demo anyone can stand up." +- **Fallbacks for sites that don't play well with refreshed cookies.** Some platforms invalidate sessions if they see cookies arriving from unexpected contexts. YouTube and TikTok have tolerated the pattern; not every site will. +- **Cookie-refresh failure detection.** Today the capability throws "Not logged in — cookies may be expired" and the UI surfaces an error. A healthier loop would detect staleness proactively and notify the owner (via push, email, or the extension itself) before the next demo call fails.