custom domains - middleware and verification #2145

Soxasora · 2025-05-01T23:59:27Z

Description

Part of #1942
Focuses on custom domain insertion, providing a working form and verification process via pgboss.
Uses ACM with LocalStack to provide certificates CRUD simulation.

Screenshots

Adding a domain

ACM validation values

Domain is now ACTIVE

Custom domain view
TODO: UI/UX adjustments PR

Additional Context

#1958 is being split to facilitate review and build greater stuff on top of it.
Doesn't contain Auth Sync or custom branding.

This PR provides:

custom domains
-- context
-- crud, resolvers, fragments
-- form with basic UI/UX
-- ACM certificates support
-- local DNS server via dnsmasq support
middleware
-- custom domain detection via endpoint
-- doesn't apply referrals yet (no rewrites here)
domain verification
-- job every 5 minutes
-- AWS error handling
-- log every verification attempt and its steps
-- normalized schema for Domain, Record, Attempt, Certificate
triggers
-- on territory stop, HOLD the domain
-- on territory takeover, delete the domain
-- on domain/domainCertificate deletion, delete from ACM via pgboss job
workers
-- full domain verification
-- clear domains that have been on HOLD for more than 30 days
-- delete certificate from ACM (job)

Postponed TODOs:

give feedbacks on domain verification to the user now that we have DomainVerificationAttempt

Checklist

Are your changes backwards compatible? Please answer below:
Yes, everything is an addition. An exception is the way we handle S3 via localstack, now the container is called AWS and also allows ACM simulations other than S3.

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7, QA OK, domain verification handles errors correctly and follows its step, middleware correctly detects a custom domain and obtains its connected sub.

For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes, it uses standard SN colors and styles that fits with theme changes.

Did you introduce any new environment variables? If so, call them out explicitly here:
An endpoint for localstack that can be contacted by our worker container
LOCALSTACK_ENDPOINT=http://aws:4566

local DNS server via dnsmasq, with fallback to 1.1.1.1 in both dnsmasq and code.
DOMAINS_DNS_SERVER=172.20.0.2:53

- ACM support - custom domains crud, resolvers, fragments - custom domains form, guidelines - custom domains context - domain verification every 5 minutes via pgboss - domain validation schema - basic custom domains middleware, to be completed - TODOs tracings

- CustomDomain -> Domain - DomainVerification table - CNAME, TXT, SSL verification types - WIP DomainVerification upsert

…ange status of a Record from its Attempt, multi-purpose dns verification

…tch as they can fail

- use DomainVerificationStatus enum for domains and records - adapt Territory Form UI to new schema - return 'records' as an object with its types - wip: prepare for attempts and certificate usage for prisma

…tion triggers

fix: - fix setDomain mutation transaction - fix schema typedefs enhance: - DNS records guidelines with flex-wrap for longer records cleanup: - add comments to worker - remove console.log on validation values

…main

huumn · 2025-05-14T20:00:05Z

https://github.com/stackernews/stacker.news/blob/c7321351bfef74dc2cfa9d0c2542bb3fbbc5db2c/docs/dev/custom-domains-base.md

So to QA this I need to use real DNS? I suspect we'll want to mock DNS somehow, like we mock ACM etc, going forward to make this trivial to iterate on.

huumn

I did a very casual high level first pass to get acquainted. It looks very good.

Thank you for separating it into its own PR. Also thanks to @ekzyis for his first review which I'm sure led to lots of improvements.

I left some questions and nits. I'll make another high level pass tomorrow. Then maybe again depending on when we can test this in a self-contained way.

api/acm/index.js

api/resolvers/domain.js

pages/api/domains/index.js

prisma/migrations/20250504202129_custom_domains_base/migration.sql

huumn · 2025-05-15T00:42:30Z

worker/territory.js

+      // make sure to delete the certificate from ACM if the sub is stopped, if we have it.
+      if (nextStatus === 'STOPPED' && sub.domain?.certificate?.certificateArn) {
+        await deleteDomainCertificate(sub.domain.certificate.certificateArn)
+      }
+
      await models.sub.update({
        include: { user: true },
        where: {
          name: subName
        },
        data: {
-          status: nextBillingWithGrace(sub) >= new Date() ? 'GRACE' : 'STOPPED',
+          status: nextStatus,


hmmm hard to tell if this is the right behavior unless we automatically re-add the cert when it's paid.

Hmm, a new certificate from ACM is fast and free, what's not really good (for the territory owner) is that they will be subjected again to ACM DNS validation.

Maybe we don't need to remove the certificate at all as long as the domain remains the same but this is also somewhat connected to what you said on the previous comment:

I guess it's not clear to us whether the transfer is from an alt to another alt or to another person entirely.

…ain verification, move the interval to 1 hour

…n on updates not coming from HOLD

…ainName

huumn · 2025-05-22T23:03:37Z

I gave it another pass. I won't nit anymore. I think all we need now is the cert stuff.

…workers to pick the job concurrently; fix hour check for verification interval; cast Number from sn admin ids; validate domain name only when there's one

worker/domainVerification.js

… ELBv2 implementation for development; attach a certificate to ELB as last step of domainVerification; detach certificate from ELB as part of deleteDomainCertificate; DomainVerificationStage ELB_ATTACH_CERTIFICATE

huumn · 2025-05-23T18:09:51Z

Can you elaborate on the assumptions you make about production somewhere?

Specifically, I'm wondering how this will interact with our existing application load balancer rules. But there may be more assumptions, we need to verify/modify either in this code or in our infrastructure's configuration.

If I'm correct, this code only interacts with ACM and ELB (of which we use the application variant in prod). Is that correct?

Soxasora · 2025-05-23T21:21:59Z

The assumptions that I made in doing this were that
a) we only have a single load balancer
b) we have stacker.news pointing at that load balancer (listener).

In the code I specifically point at the first (and should be only) load balancer and on that ELB I point at the first HTTPS listener, because I excluded the possibility of having multiple load balancers or that we could have multiple listeners in various positions.

This code is made to be tweaked for the real AWS config. I think we can directly use the listener ARN and avoid useless round-trips.

I'm wondering how this will interact with our existing application load balancer rules

Back on the main question, the only thing that this new code does is attaching or detaching an ACM certificate to a listener[0] of a load balancer[0]. So it shouldn't interfere with the existing ELB configuration as we use only certificate ARNs.

Undoubtedly having stacker.news pointing directly at the ELB DNS is something that free us from extra hoops such as creating ad-hoc subdomains per each custom domain. If we were to have SN pointing at CloudFront and then ELB, we would need to create CloudFront subdomains and give them to the user to point their CNAME record at.

huumn · 2025-05-24T00:13:53Z

Cool! Thank you!

… functions that we don't need; consistency clean-up

…l errors, instead catch them and store a log

Soxasora · 2025-05-24T11:48:50Z

I think we can directly use the listener ARN and avoid useless round-trips.

I preferred to switch to this, we can avoid querying load balancers and its listeners and operate directly on the correct listener this way.

Also now AWS requests throw errors that triggers 3 retries with pgboss, after those 3 retries the domain gets put on HOLD. These errors are logged on DomainVerificationAttempt under the stage GENERAL

huumn · 2025-05-24T17:03:15Z

I need to confirm we can add as many certs to ALB as we want, and that one doesn't replace the other, and that they are served appropriately without us specifying special rules. In part this is a note to myself, but if you have links to docs or anything sox let me know.

My biggest fear with this PR is that shipping it breaks normal site operations because otherwise it's just a private feature.

Soxasora · 2025-05-24T17:59:33Z

I need to confirm we can add as many certs to ALB as we want

Ah, this is where it fails at scaling. ALBs have a cap of 25 additional ACM certificates and while a quota increase can be requested, it's not ideal and it can be expensive (there's not an easy way to figure out these costs without making a request it seems).

I did go on with this implementation for our internal testing period because it's compatible with the live AWS configuration but I had a different idea months ago, more on this later in the text.

that one (cert) doesn't replace the other

We can't replace a default certificate of a load balancer without specifically doing that via modifyListener. This code only adds or remove a certificate from the list of additional certificates.

that the (certs) are served appropriately without us specifying special rules

Something cool about the load balancer is that:

The load balancer uses a smart certificate selection algorithm with support for SNI. If the hostname provided by a client matches a single certificate in the certificate list, the load balancer selects this certificate.
^{SSL certificates for your Application Load Balancer}

So when we visit www.pizza.com pointed to stacker.news, the load balancer will select the correct certificate for the incoming hostname.

Everything is pretty cool but we need to move away from the actual configuration if we want to scale and allow more than 25 custom domains.

The CloudFront plan

This is my original proposal:
A better way to deal with custom domains is to use CloudFront distributions per custom domain and point those distributions to our ALB. We can create up to 500 distributions and it's a matter of creating a distribution right after domain verification, with the related ACM certificate attached to it.

We would then ask the user to point www.pizza.com to that CloudFront distribution which can either be 1234random21.cloudfront.net or masked with pizza.territory.stacker.news pointing to 1234random21.cloudfront.net

The alternate CloudFront plan

ACM certificates supports up to 10 domain names and a CloudFront distribution can only have a single ACM certificate.
We can choose to be 'smart' (but imo over-engineering) by batching custom domains in groups of 10 and add distributions once every 10 domains.
It adds complexity: we would need to keep track of ACM certificates and its attached domains, re-issue the certificate every time a new domain comes up, orchestrate CloudFront to issue a distribution when we reach the 10 domains cap.

I personally don't recommend this for now, but we can think about it.

this is not a thesis sox, tldr

For now we can use the live ALB and safely add/remove certificates without ever messing with the live configuration, but if we want to scale we need to move to CloudFront.

huumn · 2025-05-24T18:14:30Z

Also, are we sure ACM will let us generate certificates by just asking for one? I recall needing to perform DNS validation to get the cert approved.

https://docs.aws.amazon.com/acm/latest/userguide/dns-validation.html

Soxasora · 2025-05-24T18:21:08Z

worker/domainVerification.js

+    // STEP 2b: Get the validation values for the certificate
+    if (certificateArn && !recordMap.SSL) {
+      await getACMValidationValues(domain, models, certificateArn)
+
+      // return PENDING to check ACM validation later
+      return { status, message: 'Certificate issued and validation values stored.' }
+    }
+
+    // STEP 2c: Check ACM validation
+    const sslVerified = await checkACMValidation(domain, models, recordMap.SSL)
+    if (!sslVerified) return { status, message: 'ACM validation is still pending.' }


I recall needing to perform DNS validation to get the cert approved.

Yep! This is where we get the DNS validation values for the ACM certificate and on step 2c we poll ACM to check the result of the DNS validation, finally we update the certificate status.

Are we giving those validation values to the user? How are they being used?

I might've missed where we do this.

Ah yes, sorry, I was editing the first comment to explain how we present this to the user.

This is step 1 of domain guidelines on territory form:

After successful DNS verification we present the 2nd step:

edit: the domain form is kind of a mess right now and even if presentable to the user, it lacks errors and nice-to-have stuff that I want to include before going fully public.

Oh, I didn't know there were multiple steps!

In local dev, during my QA, I recall SSL was verified without me doing the second step. I could be misremembering though. hmmm

I could be misremembering though.

No! You're right, I got lucky grabbing this screenshot.
It polls Domain every 30 seconds in the territory form, and the domainVerification job also runs every 30 seconds, but in dev every step is done really quickly, so timing is a bitch¹

Footnotes

himym lol ↩

I'm confused is this a bug or how it's expected to work in local?

I guess because local ACM doesn't check for this CNAME record, it automatically issues it.

Also, this makes me wonder if the _snverify... TXT check could just be replaced with us checking for the ACM verification record - ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain?

... We wouldn't even need to rely on ACM's verification status ... we could check for the CNAME ourselves ...

I guess the one downside to this is that ACM is only used after the TXT record is verified.

I guess because local ACM doesn't check for this CNAME record, it automatically issues it.

Yeah, localstack doesn't simulate timings and stuff like that, so it's kind of a rough experience but it's the closest we can get to mock ACM. A nice thing is that the domainVerification job is prepared for this as it schedules a job after getting the validation values.

ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain

Cool! I definitely like this, less DNS records for the user ^^

we could check for the CNAME ourselves ...

Well, we still poll ACM to know if the certificate is 100% ISSUED (which means fully DNS validated) and be safe before trying to attach it to the load balancer, we can benefit from this by using it as part of domain ownership proof. If it's okay for you, I'd go with that!

Go for it. I think that'd be cleaner UX and I can't think of any reason to not do it.

Done. Passed first QA without any problems.
Some functions relied heavily on the possibility to have more than one DNS record, like DNS verification or the Territory Edit UI/UX. They've been adjusted for this and also reduced some complexity when it came to SSL verification.

Good call ^^

huumn · 2025-05-24T18:32:45Z

Ah, this is where it fails at scaling. ALBs have a cap of 25 additional ACM certificates and while a quota increase can be requested, it's not ideal and it can be expensive (there's not an easy way to figure out these costs without making a request it seems).

I think we are fine with 25 certs for now. We can program in a limit so that we don't exceed quotas if need be. Once we hit the quota we can request an increase. It's unclear what the maximum quota allowed is, but we should be fine for now.

… only CNAME (dns verification) or CNAME and SSL records (territory UI/UX)

huumn

I'm marking this as approved as discussed. I will QA one final time next week and we'll deploy it.

…ings

Soxasora added 2 commits May 1, 2025 18:38

Domains refactor, Domain Verification normalization

5e80c3f

- CustomDomain -> Domain - DomainVerification table - CNAME, TXT, SSL verification types - WIP DomainVerification upsert

Soxasora force-pushed the custom_domains_base branch from cd46aeb to 5e80c3f Compare May 2, 2025 23:33

Soxasora added 3 commits May 3, 2025 17:26

Domains normalization: Attempts, Records, Certificates

4d24845

Domain Verification worker adjusted to new schema; use triggers to ch…

b624c59

…ange status of a Record from its Attempt, multi-purpose dns verification

wip: Domain Verification worker, log all verification steps

89f1eb4

Soxasora closed this May 6, 2025

wip: clearer Domain Verification flow, surround AWS calls with try ca…

9d1c137

…tch as they can fail

Soxasora reopened this May 6, 2025

Soxasora added 9 commits May 6, 2025 14:51

Domain Verification schema updates

dc119a8

- use DomainVerificationStatus enum for domains and records - adapt Territory Form UI to new schema - return 'records' as an object with its types - wip: prepare for attempts and certificate usage for prisma

HOLD the domain and delete the certificate when a territory expires

67fb2c8

delete the certificate from ACM when we're about to STOP a territory

725ce81

Domain resolver refactor, use transactions, add comments

f3930f7

Stages for Domain Verification attempts logging, fix certificate dele…

01e319e

…tion triggers

separate ACM certificate requests and validation values

e132ad0

Domains UI/UX enhancements; core fixes to schema; general cleanup

cd9cb68

fix: - fix setDomain mutation transaction - fix schema typedefs enhance: - DNS records guidelines with flex-wrap for longer records cleanup: - add comments to worker - remove console.log on validation values

delete any existing domain verification jobs if we're updating the do…

9e96d7c

…main

Log AWS-related error messages; fix deleteCertificate recursion

82a71f5

Soxasora force-pushed the custom_domains_base branch from 775a966 to 82a71f5 Compare May 8, 2025 23:33

Soxasora and others added 2 commits May 8, 2025 18:47

fix missing await on async customDomainMiddleware

f95ab6a

Merge branch 'master' into custom_domains_base

4f49382

Soxasora marked this pull request as ready for review May 8, 2025 23:55

Soxasora and others added 2 commits May 8, 2025 19:04

hotfix: delete certificate from ACM also on domain removal

2382f3b

Merge branch 'master' into custom_domains_base

c732135

huumn reviewed May 15, 2025

View reviewed changes

Soxasora added 2 commits May 16, 2025 10:52

prepare for dnsmasq, light cleanup

ca13d80

fix DNS server typo

2a77fd1

Soxasora mentioned this pull request May 16, 2025

local DNS server via dnsmasq #2168

Merged

don't ask ACM to delete a certificate in a db transaction

072c1ae

Soxasora added 4 commits May 22, 2025 14:25

30 seconds of interval between verification jobs, after 1 hour of dom…

476c10b

…ain verification, move the interval to 1 hour

also get records when getting the existing domain, recreate the domai…

ef549a9

…n on updates not coming from HOLD

hotfix: use validateSchema the correct way, change from domain to dom…

1e3b1c6

…ainName

hide custom domains from the world but the admins

25674c2

debounce next verification jobs with a singleton key, avoiding other …

eddd453

…workers to pick the job concurrently; fix hour check for verification interval; cast Number from sn admin ids; validate domain name only when there's one

Soxasora force-pushed the custom_domains_base branch from dd22280 to eddd453 Compare May 22, 2025 23:13

Soxasora commented May 22, 2025

View reviewed changes

worker/domainVerification.js Outdated Show resolved Hide resolved

Soxasora added 3 commits May 24, 2025 05:13

use directly the interested ELB Listener ARN via env vars; get rid of…

f36eef4

… functions that we don't need; consistency clean-up

throw database and AWS-related errors; don't log the STAGE on critica…

51aadf2

…l errors, instead catch them and store a log

remove unused certificate attachment to ELB checks

5bf8aba

Soxasora commented May 24, 2025

View reviewed changes

Soxasora and others added 5 commits May 24, 2025 13:35

remove unused ELB env var, remove useless console.logs

1643c09

eradicate TXT records from custom domains; adjust functions to expect…

d579b55

… only CNAME (dns verification) or CNAME and SSL records (territory UI/UX)

Merge branch 'master' into custom_domains_base

102b2f2

pass CNAME record directly instead of the whole records map

093910b

don't delete the domain if resuming from HOLD

0ad0b33

huumn approved these changes May 26, 2025

View reviewed changes

Soxasora added 3 commits May 28, 2025 10:38

wip: custom domains documentation

ebf2c8b

docs: explain all the triggers, ACM and ALB implementations, fix head…

f361958

…ings

docs: clearer explanations

6a0b2cc

Uh oh!

custom domains - middleware and verification #2145

Are you sure you want to change the base?

custom domains - middleware and verification #2145

Uh oh!

Conversation

Soxasora commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Screenshots

Additional Context

Checklist

Uh oh!

huumn commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huumn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huumn May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Soxasora May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huumn commented May 22, 2025

Uh oh!

Uh oh!

huumn commented May 23, 2025

Uh oh!

Soxasora commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huumn commented May 24, 2025

Uh oh!

Soxasora commented May 24, 2025

Uh oh!

huumn commented May 24, 2025

Uh oh!

Soxasora commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The CloudFront plan

The alternate CloudFront plan

this is not a thesis sox, tldr

Uh oh!

huumn commented May 24, 2025

Uh oh!

Soxasora May 24, 2025

Choose a reason for hiding this comment

Uh oh!

huumn May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Soxasora May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huumn May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Soxasora May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

huumn May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Soxasora May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huumn May 25, 2025

Soxasora commented May 1, 2025 •

edited

Loading

huumn commented May 14, 2025 •

edited

Loading

Soxasora May 15, 2025 •

edited

Loading

Soxasora commented May 23, 2025 •

edited

Loading

Soxasora commented May 24, 2025 •

edited

Loading

Soxasora May 24, 2025 •

edited

Loading

huumn May 24, 2025 •

edited

Loading

Soxasora May 24, 2025 •

edited

Loading

huumn May 24, 2025 •

edited

Loading

Soxasora May 24, 2025 •

edited

Loading