-
-
Notifications
You must be signed in to change notification settings - Fork 130
custom domains - middleware and verification #2145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- ACM support - custom domains crud, resolvers, fragments - custom domains form, guidelines - custom domains context - domain verification every 5 minutes via pgboss - domain validation schema - basic custom domains middleware, to be completed - TODOs tracings
- CustomDomain -> Domain - DomainVerification table - CNAME, TXT, SSL verification types - WIP DomainVerification upsert
cd46aeb
to
5e80c3f
Compare
…ange status of a Record from its Attempt, multi-purpose dns verification
…tch as they can fail
- use DomainVerificationStatus enum for domains and records - adapt Territory Form UI to new schema - return 'records' as an object with its types - wip: prepare for attempts and certificate usage for prisma
fix: - fix setDomain mutation transaction - fix schema typedefs enhance: - DNS records guidelines with flex-wrap for longer records cleanup: - add comments to worker - remove console.log on validation values
775a966
to
82a71f5
Compare
So to QA this I need to use real DNS? I suspect we'll want to mock DNS somehow, like we mock ACM etc, going forward to make this trivial to iterate on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a very casual high level first pass to get acquainted. It looks very good.
Thank you for separating it into its own PR. Also thanks to @ekzyis for his first review which I'm sure led to lots of improvements.
I left some questions and nits. I'll make another high level pass tomorrow. Then maybe again depending on when we can test this in a self-contained way.
prisma/migrations/20250504202129_custom_domains_base/migration.sql
Outdated
Show resolved
Hide resolved
worker/territory.js
Outdated
// make sure to delete the certificate from ACM if the sub is stopped, if we have it. | ||
if (nextStatus === 'STOPPED' && sub.domain?.certificate?.certificateArn) { | ||
await deleteDomainCertificate(sub.domain.certificate.certificateArn) | ||
} | ||
|
||
await models.sub.update({ | ||
include: { user: true }, | ||
where: { | ||
name: subName | ||
}, | ||
data: { | ||
status: nextBillingWithGrace(sub) >= new Date() ? 'GRACE' : 'STOPPED', | ||
status: nextStatus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm hard to tell if this is the right behavior unless we automatically re-add the cert when it's paid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, a new certificate from ACM is fast and free, what's not really good (for the territory owner) is that they will be subjected again to ACM DNS validation.
Maybe we don't need to remove the certificate at all as long as the domain remains the same but this is also somewhat connected to what you said on the previous comment:
I guess it's not clear to us whether the transfer is from an alt to another alt or to another person entirely.
…ain verification, move the interval to 1 hour
…n on updates not coming from HOLD
I gave it another pass. I won't nit anymore. I think all we need now is the cert stuff. |
…workers to pick the job concurrently; fix hour check for verification interval; cast Number from sn admin ids; validate domain name only when there's one
dd22280
to
eddd453
Compare
… ELBv2 implementation for development; attach a certificate to ELB as last step of domainVerification; detach certificate from ELB as part of deleteDomainCertificate; DomainVerificationStage ELB_ATTACH_CERTIFICATE
Can you elaborate on the assumptions you make about production somewhere? Specifically, I'm wondering how this will interact with our existing application load balancer rules. But there may be more assumptions, we need to verify/modify either in this code or in our infrastructure's configuration. If I'm correct, this code only interacts with ACM and ELB (of which we use the application variant in prod). Is that correct? |
The assumptions that I made in doing this were that In the code I specifically point at the first (and should be only) load balancer and on that ELB I point at the first HTTPS listener, because I excluded the possibility of having multiple load balancers or that we could have multiple listeners in various positions. This code is made to be tweaked for the real AWS config. I think we can directly use the listener ARN and avoid useless round-trips.
Back on the main question, the only thing that this new code does is attaching or detaching an ACM certificate to a listener[0] of a load balancer[0]. So it shouldn't interfere with the existing ELB configuration as we use only certificate ARNs. Undoubtedly having stacker.news pointing directly at the ELB DNS is something that free us from extra hoops such as creating ad-hoc subdomains per each custom domain. If we were to have SN pointing at CloudFront and then ELB, we would need to create CloudFront subdomains and give them to the user to point their CNAME record at. |
Cool! Thank you! |
… functions that we don't need; consistency clean-up
…l errors, instead catch them and store a log
I preferred to switch to this, we can avoid querying load balancers and its listeners and operate directly on the correct listener this way. Also now AWS requests throw errors that triggers 3 retries with pgboss, after those 3 retries the domain gets put on |
I need to confirm we can add as many certs to ALB as we want, and that one doesn't replace the other, and that they are served appropriately without us specifying special rules. In part this is a note to myself, but if you have links to docs or anything sox let me know. My biggest fear with this PR is that shipping it breaks normal site operations because otherwise it's just a private feature. |
Ah, this is where it fails at scaling. ALBs have a cap of 25 additional ACM certificates and while a quota increase can be requested, it's not ideal and it can be expensive (there's not an easy way to figure out these costs without making a request it seems). I did go on with this implementation for our internal testing period because it's compatible with the live AWS configuration but I had a different idea months ago, more on this later in the text.
We can't replace a default certificate of a load balancer without specifically doing that via modifyListener. This code only adds or remove a certificate from the list of additional certificates.
Something cool about the load balancer is that:
So when we visit www.pizza.com pointed to stacker.news, the load balancer will select the correct certificate for the incoming hostname. Everything is pretty cool but we need to move away from the actual configuration if we want to scale and allow more than 25 custom domains. The CloudFront planThis is my original proposal: We would then ask the user to point The alternate CloudFront planACM certificates supports up to 10 domain names and a CloudFront distribution can only have a single ACM certificate. I personally don't recommend this for now, but we can think about it. this is not a thesis sox, tldrFor now we can use the live ALB and safely add/remove certificates without ever messing with the live configuration, but if we want to scale we need to move to CloudFront. |
Also, are we sure ACM will let us generate certificates by just asking for one? I recall needing to perform DNS validation to get the cert approved. https://docs.aws.amazon.com/acm/latest/userguide/dns-validation.html |
// STEP 2b: Get the validation values for the certificate | ||
if (certificateArn && !recordMap.SSL) { | ||
await getACMValidationValues(domain, models, certificateArn) | ||
|
||
// return PENDING to check ACM validation later | ||
return { status, message: 'Certificate issued and validation values stored.' } | ||
} | ||
|
||
// STEP 2c: Check ACM validation | ||
const sslVerified = await checkACMValidation(domain, models, recordMap.SSL) | ||
if (!sslVerified) return { status, message: 'ACM validation is still pending.' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall needing to perform DNS validation to get the cert approved.
Yep! This is where we get the DNS validation values for the ACM certificate and on step 2c we poll ACM to check the result of the DNS validation, finally we update the certificate status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we giving those validation values to the user? How are they being used?
I might've missed where we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry, I was editing the first comment to explain how we present this to the user.
This is step 1 of domain guidelines on territory form:
After successful DNS verification we present the 2nd step:
edit: the domain form is kind of a mess right now and even if presentable to the user, it lacks errors and nice-to-have stuff that I want to include before going fully public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know there were multiple steps!
In local dev, during my QA, I recall SSL was verified without me doing the second step. I could be misremembering though. hmmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused is this a bug or how it's expected to work in local?
I guess because local ACM doesn't check for this CNAME record, it automatically issues it.
Also, this makes me wonder if the _snverify...
TXT check could just be replaced with us checking for the ACM verification record - ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain?
... We wouldn't even need to rely on ACM's verification status ... we could check for the CNAME ourselves ...
I guess the one downside to this is that ACM is only used after the TXT record is verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess because local ACM doesn't check for this CNAME record, it automatically issues it.
Yeah, localstack doesn't simulate timings and stuff like that, so it's kind of a rough experience but it's the closest we can get to mock ACM. A nice thing is that the domainVerification job is prepared for this as it schedules a job after getting the validation values.
ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain
Cool! I definitely like this, less DNS records for the user ^^
we could check for the CNAME ourselves ...
Well, we still poll ACM to know if the certificate is 100% ISSUED
(which means fully DNS validated) and be safe before trying to attach it to the load balancer, we can benefit from this by using it as part of domain ownership proof. If it's okay for you, I'd go with that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for it. I think that'd be cleaner UX and I can't think of any reason to not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Passed first QA without any problems.
Some functions relied heavily on the possibility to have more than one DNS record, like DNS verification or the Territory Edit UI/UX. They've been adjusted for this and also reduced some complexity when it came to SSL verification.
Good call ^^
I think we are fine with 25 certs for now. We can program in a limit so that we don't exceed quotas if need be. Once we hit the quota we can request an increase. It's unclear what the maximum quota allowed is, but we should be fine for now. |
… only CNAME (dns verification) or CNAME and SSL records (territory UI/UX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm marking this as approved as discussed. I will QA one final time next week and we'll deploy it.
Description
Part of #1942
Focuses on custom domain insertion, providing a working form and verification process via pgboss.
Uses ACM with LocalStack to provide certificates CRUD simulation.
Screenshots
Adding a domain

ACM validation values

Domain is now ACTIVE

Custom domain view

TODO: UI/UX adjustments PR
Additional Context
#1958 is being split to facilitate review and build greater stuff on top of it.
Doesn't contain Auth Sync or custom branding.
This PR provides:
custom domains
-- context
-- crud, resolvers, fragments
-- form with basic UI/UX
-- ACM certificates support
-- local DNS server via dnsmasq support
middleware
-- custom domain detection via endpoint
-- doesn't apply referrals yet (no rewrites here)
domain verification
-- job every 5 minutes
-- AWS error handling
-- log every verification attempt and its steps
-- normalized schema for Domain, Record, Attempt, Certificate
triggers
-- on territory stop, HOLD the domain
-- on territory takeover, delete the domain
-- on domain/domainCertificate deletion, delete from ACM via pgboss job
workers
-- full domain verification
-- clear domains that have been on HOLD for more than 30 days
-- delete certificate from ACM (job)
Postponed TODOs:
DomainVerificationAttempt
Checklist
Are your changes backwards compatible? Please answer below:
Yes, everything is an addition. An exception is the way we handle S3 via localstack, now the container is called AWS and also allows ACM simulations other than S3.
On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7, QA OK, domain verification handles errors correctly and follows its step, middleware correctly detects a custom domain and obtains its connected sub.
For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes, it uses standard SN colors and styles that fits with theme changes.
Did you introduce any new environment variables? If so, call them out explicitly here:
An endpoint for localstack that can be contacted by our worker container
LOCALSTACK_ENDPOINT=http://aws:4566
local DNS server via dnsmasq, with fallback to 1.1.1.1 in both dnsmasq and code.
DOMAINS_DNS_SERVER=172.20.0.2:53