docs(security): land RLS-lockdown / storage / realtime-(a) remediation plans (#231)

Jose-Gael-Cruz-Lopez · Jose-Gael-Cruz-Lopez · commit b248798abf34 · 2026-06-13T22:20:46.000-04:00
Carved from the #232 draft so the planning docs live in main (the SQL stays in #232 as the applied-to-prod record, not merged). The RLS lockdown is already applied to production (anon locked out, confirmed). #230 references the realtime-jwt-bridge design doc landed here.
diff --git a/docs/security/realtime-jwt-bridge-design.md b/docs/security/realtime-jwt-bridge-design.md
@@ -0,0 +1,83 @@
+# Realtime option (a) — JWT-mint bridge (design, #231)
+
+**Status: DESIGN for review. Build AFTER the RLS lockdown lands** — this is the
+piece that restores `room_messages` realtime under RLS. Not built yet.
+
+## Goal
+Keep live chat working once RLS is enabled, **without** the public anon key:
+the realtime client authenticates as the logged-in user via a Supabase JWT, and
+an RLS policy scopes delivery to the user's room memberships.
+
+## Why a JWT is needed
+Sapling uses its **own HMAC session** (`sapling_session`), not Supabase Auth, so
+`auth.uid()` is empty and RLS can't identify the user. We bridge by minting a
+Supabase-format JWT for the same user and handing it to the realtime client.
+
+## Components
+
+### 1. Mint a Supabase JWT (backend)
+At login (and on refresh), the backend mints a short-lived JWT signed with the
+**Supabase JWT secret** (legacy HS256; the same secret behind the anon key):
+```
+claims: { sub: <user_id>, role: "authenticated", aud: "authenticated", exp: now+1h, iat: now }
+```
+- New endpoint, e.g. `GET /api/auth/realtime-token` (auth-gated by the existing
+  session): returns `{ token, expires_at }`, minted from the still-valid 30-day
+  session.
+- New env `SUPABASE_JWT_SECRET` (from Supabase → Settings → API → JWT secret).
+  Add it to `validate_config()` (#174) as required outside local.
+- Note: Supabase is migrating to **asymmetric signing keys**. If this project is
+  on the new keys, mint/verify with the project's signing key instead of an
+  HS256 shared secret — confirm which at build time.
+
+### 2. RLS SELECT policy on room_messages (membership-scoped)
+```sql
+CREATE POLICY room_messages_member_read ON public.room_messages
+  FOR SELECT TO authenticated
+  USING (EXISTS (
+    SELECT 1 FROM public.room_members m
+    WHERE m.room_id = room_messages.room_id AND m.user_id = auth.uid()
+  ));
+```
+With the lockdown's RLS enabled and the JWT setting `auth.uid()`, an
+`authenticated` subscriber receives changes **only for rooms they belong to** —
+the client-side `room_id` filter stops being the only gate. (`authenticated`
+still holds the table GRANT SELECT, which the lockdown intentionally left.)
+
+### 3. Realtime delivery
+Two options, smallest first:
+- **(i) postgres_changes + the RLS policy above (recommended).** Realtime
+  evaluates the subscriber's RLS on each change, so the existing
+  `Social.tsx` subscription keeps working but is now membership-scoped. Minimal
+  client change: set the JWT (below). `room_messages` is already in the
+  `supabase_realtime` publication.
+- **(ii) Private channels (Realtime Authorization).** Mark the channel
+  `{ config: { private: true } }` and add a policy on `realtime.messages` for
+  the topic. More robust/explicit but a larger client rework. Defer unless (i)
+  proves insufficient.
+
+### 4. Client wiring (frontend)
+- After login, fetch the realtime token and apply it:
+  `getSupabase().realtime.setAuth(token)` (and pass it when (re)creating the
+  client). The client stops relying on the anon key for authorization.
+- **JWT refresh (the main complexity):** the session is **30 days** but the
+  Supabase JWT is **~1 hour**. Add a refresh loop — re-fetch the token shortly
+  before `expires_at` and call `setAuth` again — or the subscription drops when
+  the JWT expires. Handle: tab wake from sleep, network reconnect, and a failed
+  refresh (fall back to REST-only, which the #230 display fix already supports).
+
+## Sequencing
+1. RLS lockdown (separate, urgent) — breaks anon realtime (accepted).
+2. This bridge — restores realtime for authenticated users, membership-scoped.
+
+## Effort estimate
+Moderate. Backend JWT-mint endpoint + env wiring (small); RLS SELECT policy
+(small); client setAuth (small); **JWT refresh loop + reconnect handling
+(the real work)**. Reuses the existing Supabase realtime architecture — no
+backend fan-out/broker, no client rebuild (contrast option (c)).
+
+## Optional follow-up
+If live reactions are wanted back (the dead `room_reactions` subscription is
+being removed), publish `room_reactions` to `supabase_realtime` and add the same
+membership-scoped SELECT policy. Until then, reactions update on
+load/refresh via REST.
diff --git a/docs/security/rls-lockdown-plan.md b/docs/security/rls-lockdown-plan.md
@@ -0,0 +1,97 @@
+# Project-wide RLS lockdown — apply & verification plan (#231)
+
+**Status: APPLIED to production 2026-06-13.** Anon is confirmed locked out
+(direct REST calls to `users`/`oauth_tokens`/`user_roles`/etc. now return
+`permission denied`, SQLSTATE 42501). This doc is the record of what was applied
+and how it was verified. The SQL scripts (`rls_lockdown.sql` apply,
+`rls_lockdown_rollback.sql` emergency revert) live in **PR #232** as the applied
+record — intentionally NOT merged to `main` (the change went straight to prod;
+nothing re-runs them from the repo).
+
+## Why this is safe for the backend
+The backend authenticates to Supabase with `SUPABASE_SERVICE_KEY` → the
+`service_role`, which has **`rolbypassrls = true`** (verified live: `SELECT
+rolname, rolbypassrls FROM pg_roles` → `service_role=t`, `anon=f`,
+`authenticated=f`). RLS does not apply to row-bypass roles, so **every backend
+query keeps working unchanged**. RLS only constrains `anon`/`authenticated`,
+which is exactly the public-anon-key path we're closing.
+
+## Expected breakage (accepted)
+Anon realtime on `room_messages` stops delivering once RLS is on / anon DML is
+revoked. This stays broken until the **option (a)** JWT bridge lands
+(`docs/security/realtime-jwt-bridge-design.md`). Per decision, the full-DB
+exposure outranks live chat updates. The #230 display fix already re-fetches via
+the (service-role) REST endpoint, so chat still works on load/refresh — only the
+live push is paused.
+
+## Test-first on a branch (if available)
+Supabase branching wasn't reachable via the MCP for this project (`list_branches`
+errored), so it may be on a plan/permission that doesn't expose it. If you have
+branching:
+1. Create a dev branch in the dashboard.
+2. Run `rls_lockdown.sql` against the branch.
+3. Run the verification below pointed at the branch.
+4. Merge the branch (or apply the same SQL to prod) once green.
+
+If branching is unavailable: apply to prod during a low-traffic window with
+`rls_lockdown_rollback.sql` open and ready. The change is transactional
+(`BEGIN/COMMIT`) and fast (DDL only, no table rewrites).
+
+## Pre-apply snapshot (record for diffing)
+```sql
+SELECT count(*) FILTER (WHERE relrowsecurity) AS rls_on,
+       count(*) FILTER (WHERE NOT relrowsecurity) AS rls_off
+FROM pg_class WHERE relnamespace='public'::regnamespace AND relkind='r';
+-- expected before: rls_on=2, rls_off=38
+```
+
+## Apply
+Run `backend/db/security/rls_lockdown.sql`.
+
+## Post-apply verification checklist
+1. **RLS now on for all public tables:**
+   ```sql
+   SELECT count(*) FILTER (WHERE NOT relrowsecurity) AS still_off
+   FROM pg_class WHERE relnamespace='public'::regnamespace AND relkind='r';
+   -- expect: still_off = 0
+   ```
+2. **anon has no table DML left:**
+   ```sql
+   SELECT count(*) AS anon_grants
+   FROM information_schema.role_table_grants
+   WHERE table_schema='public' AND grantee='anon'
+     AND privilege_type IN ('SELECT','INSERT','UPDATE','DELETE');
+   -- expect: anon_grants = 0
+   ```
+3. **anon is blocked at the REST endpoint** (the actual exposure): with the
+   public anon key,
+   ```
+   curl -s -o /dev/null -w "%{http_code}\n" \
+     "https://jxqcmjqtjlpuxfrxmrdv.supabase.co/rest/v1/users?select=id&limit=1" \
+     -H "apikey: <ANON_KEY>" -H "Authorization: Bearer <ANON_KEY>"
+   ```
+   Expect **401** (or `[]` with permission-denied), not a row. Repeat for
+   `user_roles`, `oauth_tokens`, `messages`.
+4. **Backend still works (service_role):**
+   - `cd backend && python -m pytest tests/ -q` (suite is hermetic; sanity only).
+   - Hit live read + write endpoints against the target DB and confirm normal
+     behavior, e.g. `GET /api/auth/me` (read), a calendar/gradebook create
+     (write), a notes save. All should succeed exactly as before (service_role
+     bypasses RLS).
+5. **Realtime is paused (expected):** open a room — messages still load and
+   refresh via REST; live push is down until option (a). No errors beyond the
+   subscription returning nothing.
+
+## Rollback
+If something critical breaks: run
+`backend/db/security/rls_lockdown_rollback.sql` (re-grants anon, disables RLS on
+the 38). ⚠️ This restores the insecure state — re-apply the lockdown + option
+(a) as soon as the issue is understood.
+
+## Follow-ups (not in this script)
+- `authenticated` keeps its grants (RLS-with-no-policy denies it today); option
+  (a) adds membership-scoped policies for it on `room_messages`.
+- Storage hardening is a separate track (`docs/security/storage-hardening-plan.md`).
+- The 2 already-RLS tables (`achievement_cosmetics`, `achievement_triggers`)
+  have RLS on but **no policies** — confirm nothing legitimately reads them via
+  anon (the backend uses service_role, so it's unaffected).
diff --git a/docs/security/storage-hardening-plan.md b/docs/security/storage-hardening-plan.md
@@ -0,0 +1,54 @@
+# Storage hardening — PR-plan (#231)
+
+**Status: DRAFT plan for review. Nothing applied.**
+
+## Live findings (Sapling prod, read-only)
+Buckets that actually exist (3):
+
+| Bucket | `public` | Written by | Read by | Issue |
+|---|---|---|---|---|
+| `issues-media-files` (issue-report screenshots) | **true** | frontend **anon key** (`ReportIssueFlow.tsx`) | `getPublicUrl` (public) | anon upload + public read |
+| `application_resumes` (résumés) | **true** | backend service key (`careers.py`) | `getPublicUrl` (public) | **résumé PII publicly readable** |
+| `avatars` | true | backend service key (`storage_service.py`) | public `<img>` | intended public read |
+
+`storage.objects` policies: `"Allow public read"` (SELECT, `{public}`, `issues-media-files`) and **`"Allow uploads"` (INSERT, `{public}`, no bucket/auth restriction)** → anyone can upload to **any** bucket, unauthenticated, unbounded (no size limit on `issues-media-files`/`application_resumes`).
+
+Note: `chat-images` and `cosmetic-assets` referenced in code **do not exist** — those upload paths are dead (separate cleanup; not a live exposure).
+
+## Target state
+All storage writes go through the **backend (service_role)**; private buckets are read via **backend-generated signed URLs**; only `avatars` stays public-read. After this, there are **no anon/public storage policies** — the anon storage surface is gone.
+
+| Bucket | public | upload path | read path |
+|---|---|---|---|
+| `issues-media-files` | **false** | new backend endpoint (multipart → service-key upload), reusing `request_limits.read_within_limit` + content-type allowlist (the #220/#229 pattern) | backend signed URL (admin view) |
+| `application_resumes` | **false** | already backend (`careers.py`) | backend signed URL (admin view) |
+| `avatars` | true | already backend | public (unchanged) |
+
+## Changes
+
+### SQL (review before applying)
+```sql
+BEGIN;
+UPDATE storage.buckets SET public = false WHERE id IN ('issues-media-files','application_resumes');
+DROP POLICY IF EXISTS "Allow uploads"     ON storage.objects;  -- kills the global public INSERT
+DROP POLICY IF EXISTS "Allow public read" ON storage.objects;  -- issues-media-files public read
+COMMIT;
+```
+No new storage.objects policies are needed: backend uploads/reads use `service_role` (bypasses storage RLS). `avatars` stays `public=true` so its objects remain readable without a policy.
+
+### Backend
+- New `POST /api/issue-reports/screenshot` (auth-gated via `get_session_user_id`): accepts the file, validates type+size with the shared `request_limits` helpers, uploads to `issues-media-files` with the service key (mirror `careers._upload_resume`), returns the storage path (not a public URL).
+- Signed-URL helper for private buckets (admin views of screenshots/résumés): backend issues a short-TTL signed URL via the storage REST API with the service key.
+
+### Frontend
+- `ReportIssueFlow.tsx`: stop using the anon `supabase.storage` client; POST the screenshot to the new backend endpoint. Removes a direct anon-key path (also shrinks the #231 surface).
+- Admin résumé/screenshot views: fetch signed URLs from the backend instead of assuming public URLs.
+
+## Verification
+- `storage.buckets`: `issues-media-files` and `application_resumes` show `public=false`; `avatars` stays `true`.
+- `pg_policies` (schema `storage`): the two `{public}` policies are gone.
+- Anon upload attempt → denied. Public URL to a private-bucket object → 400/403; signed URL → 200.
+- Issue-report flow and résumé upload still work end-to-end via the backend; avatars still render.
+
+## Priority
+`application_resumes` (résumé PII, publicly readable) is **equal priority** to the screenshots bucket — both flip to private first; the global public-INSERT policy is dropped in the same change.