Skip to content

custom domains - middleware and verification #2145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

Soxasora
Copy link
Member

@Soxasora Soxasora commented May 1, 2025

Description

Part of #1942
Focuses on custom domain insertion, providing a working form and verification process via pgboss.
Uses ACM with LocalStack to provide certificates CRUD simulation.

Screenshots

Adding a domain
image

ACM validation values
image

Domain is now ACTIVE
image

Custom domain view
TODO: UI/UX adjustments PR
image

Additional Context

#1958 is being split to facilitate review and build greater stuff on top of it.
Doesn't contain Auth Sync or custom branding.

This PR provides:

  • custom domains
    -- context
    -- crud, resolvers, fragments
    -- form with basic UI/UX
    -- ACM certificates support
    -- local DNS server via dnsmasq support

  • middleware
    -- custom domain detection via endpoint
    -- doesn't apply referrals yet (no rewrites here)

  • domain verification
    -- job every 5 minutes
    -- AWS error handling
    -- log every verification attempt and its steps
    -- normalized schema for Domain, Record, Attempt, Certificate

  • triggers
    -- on territory stop, HOLD the domain
    -- on territory takeover, delete the domain
    -- on domain/domainCertificate deletion, delete from ACM via pgboss job

  • workers
    -- full domain verification
    -- clear domains that have been on HOLD for more than 30 days
    -- delete certificate from ACM (job)

Postponed TODOs:

  • give feedbacks on domain verification to the user now that we have DomainVerificationAttempt

Checklist

Are your changes backwards compatible? Please answer below:
Yes, everything is an addition. An exception is the way we handle S3 via localstack, now the container is called AWS and also allows ACM simulations other than S3.

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7, QA OK, domain verification handles errors correctly and follows its step, middleware correctly detects a custom domain and obtains its connected sub.

For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes, it uses standard SN colors and styles that fits with theme changes.

Did you introduce any new environment variables? If so, call them out explicitly here:
An endpoint for localstack that can be contacted by our worker container
LOCALSTACK_ENDPOINT=http://aws:4566

local DNS server via dnsmasq, with fallback to 1.1.1.1 in both dnsmasq and code.
DOMAINS_DNS_SERVER=172.20.0.2:53

Soxasora added 2 commits May 1, 2025 18:38
- ACM support
- custom domains crud, resolvers, fragments
- custom domains form, guidelines
- custom domains context
- domain verification every 5 minutes via pgboss
- domain validation schema
- basic custom domains middleware, to be completed
- TODOs tracings
- CustomDomain -> Domain
- DomainVerification table
- CNAME, TXT, SSL verification types
- WIP DomainVerification upsert
@Soxasora Soxasora force-pushed the custom_domains_base branch from cd46aeb to 5e80c3f Compare May 2, 2025 23:33
@Soxasora Soxasora closed this May 6, 2025
@Soxasora Soxasora reopened this May 6, 2025
Soxasora added 9 commits May 6, 2025 14:51
- use DomainVerificationStatus enum for domains and records
- adapt Territory Form UI to new schema
- return 'records' as an object with its types
- wip: prepare for attempts and certificate usage for prisma
fix:
- fix setDomain mutation transaction
- fix schema typedefs

enhance:
- DNS records guidelines with flex-wrap for longer records

cleanup:
- add comments to worker
- remove console.log on validation values
@Soxasora Soxasora force-pushed the custom_domains_base branch from 775a966 to 82a71f5 Compare May 8, 2025 23:33
@Soxasora Soxasora marked this pull request as ready for review May 8, 2025 23:55
@huumn
Copy link
Member

huumn commented May 14, 2025

https://github.com/stackernews/stacker.news/blob/c7321351bfef74dc2cfa9d0c2542bb3fbbc5db2c/docs/dev/custom-domains-base.md

So to QA this I need to use real DNS? I suspect we'll want to mock DNS somehow, like we mock ACM etc, going forward to make this trivial to iterate on.

Copy link
Member

@huumn huumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a very casual high level first pass to get acquainted. It looks very good.

Thank you for separating it into its own PR. Also thanks to @ekzyis for his first review which I'm sure led to lots of improvements.

I left some questions and nits. I'll make another high level pass tomorrow. Then maybe again depending on when we can test this in a self-contained way.

Comment on lines 19 to 30
// make sure to delete the certificate from ACM if the sub is stopped, if we have it.
if (nextStatus === 'STOPPED' && sub.domain?.certificate?.certificateArn) {
await deleteDomainCertificate(sub.domain.certificate.certificateArn)
}

await models.sub.update({
include: { user: true },
where: {
name: subName
},
data: {
status: nextBillingWithGrace(sub) >= new Date() ? 'GRACE' : 'STOPPED',
status: nextStatus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm hard to tell if this is the right behavior unless we automatically re-add the cert when it's paid.

Copy link
Member Author

@Soxasora Soxasora May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a new certificate from ACM is fast and free, what's not really good (for the territory owner) is that they will be subjected again to ACM DNS validation.

Maybe we don't need to remove the certificate at all as long as the domain remains the same but this is also somewhat connected to what you said on the previous comment:

I guess it's not clear to us whether the transfer is from an alt to another alt or to another person entirely.

@huumn
Copy link
Member

huumn commented May 22, 2025

I gave it another pass. I won't nit anymore. I think all we need now is the cert stuff.

…workers to pick the job concurrently; fix hour check for verification interval; cast Number from sn admin ids; validate domain name only when there's one
@Soxasora Soxasora force-pushed the custom_domains_base branch from dd22280 to eddd453 Compare May 22, 2025 23:13
… ELBv2 implementation for development; attach a certificate to ELB as last step of domainVerification; detach certificate from ELB as part of deleteDomainCertificate; DomainVerificationStage ELB_ATTACH_CERTIFICATE
@huumn
Copy link
Member

huumn commented May 23, 2025

Can you elaborate on the assumptions you make about production somewhere?

Specifically, I'm wondering how this will interact with our existing application load balancer rules. But there may be more assumptions, we need to verify/modify either in this code or in our infrastructure's configuration.

If I'm correct, this code only interacts with ACM and ELB (of which we use the application variant in prod). Is that correct?

@Soxasora
Copy link
Member Author

Soxasora commented May 23, 2025

The assumptions that I made in doing this were that
a) we only have a single load balancer
b) we have stacker.news pointing at that load balancer (listener).

In the code I specifically point at the first (and should be only) load balancer and on that ELB I point at the first HTTPS listener, because I excluded the possibility of having multiple load balancers or that we could have multiple listeners in various positions.

This code is made to be tweaked for the real AWS config. I think we can directly use the listener ARN and avoid useless round-trips.

I'm wondering how this will interact with our existing application load balancer rules

Back on the main question, the only thing that this new code does is attaching or detaching an ACM certificate to a listener[0] of a load balancer[0]. So it shouldn't interfere with the existing ELB configuration as we use only certificate ARNs.

Undoubtedly having stacker.news pointing directly at the ELB DNS is something that free us from extra hoops such as creating ad-hoc subdomains per each custom domain. If we were to have SN pointing at CloudFront and then ELB, we would need to create CloudFront subdomains and give them to the user to point their CNAME record at.

@huumn
Copy link
Member

huumn commented May 24, 2025

Cool! Thank you!

@Soxasora
Copy link
Member Author

I think we can directly use the listener ARN and avoid useless round-trips.

I preferred to switch to this, we can avoid querying load balancers and its listeners and operate directly on the correct listener this way.

Also now AWS requests throw errors that triggers 3 retries with pgboss, after those 3 retries the domain gets put on HOLD. These errors are logged on DomainVerificationAttempt under the stage GENERAL

@huumn
Copy link
Member

huumn commented May 24, 2025

I need to confirm we can add as many certs to ALB as we want, and that one doesn't replace the other, and that they are served appropriately without us specifying special rules. In part this is a note to myself, but if you have links to docs or anything sox let me know.

My biggest fear with this PR is that shipping it breaks normal site operations because otherwise it's just a private feature.

@Soxasora
Copy link
Member Author

Soxasora commented May 24, 2025

I need to confirm we can add as many certs to ALB as we want

Ah, this is where it fails at scaling. ALBs have a cap of 25 additional ACM certificates and while a quota increase can be requested, it's not ideal and it can be expensive (there's not an easy way to figure out these costs without making a request it seems).

I did go on with this implementation for our internal testing period because it's compatible with the live AWS configuration but I had a different idea months ago, more on this later in the text.

that one (cert) doesn't replace the other

We can't replace a default certificate of a load balancer without specifically doing that via modifyListener. This code only adds or remove a certificate from the list of additional certificates.

that the (certs) are served appropriately without us specifying special rules

Something cool about the load balancer is that:

The load balancer uses a smart certificate selection algorithm with support for SNI. If the hostname provided by a client matches a single certificate in the certificate list, the load balancer selects this certificate.
SSL certificates for your Application Load Balancer

So when we visit www.pizza.com pointed to stacker.news, the load balancer will select the correct certificate for the incoming hostname.

Everything is pretty cool but we need to move away from the actual configuration if we want to scale and allow more than 25 custom domains.

The CloudFront plan

This is my original proposal:
A better way to deal with custom domains is to use CloudFront distributions per custom domain and point those distributions to our ALB. We can create up to 500 distributions and it's a matter of creating a distribution right after domain verification, with the related ACM certificate attached to it.

We would then ask the user to point www.pizza.com to that CloudFront distribution which can either be 1234random21.cloudfront.net or masked with pizza.territory.stacker.news pointing to 1234random21.cloudfront.net

The alternate CloudFront plan

ACM certificates supports up to 10 domain names and a CloudFront distribution can only have a single ACM certificate.
We can choose to be 'smart' (but imo over-engineering) by batching custom domains in groups of 10 and add distributions once every 10 domains.
It adds complexity: we would need to keep track of ACM certificates and its attached domains, re-issue the certificate every time a new domain comes up, orchestrate CloudFront to issue a distribution when we reach the 10 domains cap.

I personally don't recommend this for now, but we can think about it.

this is not a thesis sox, tldr

For now we can use the live ALB and safely add/remove certificates without ever messing with the live configuration, but if we want to scale we need to move to CloudFront.

@huumn
Copy link
Member

huumn commented May 24, 2025

Also, are we sure ACM will let us generate certificates by just asking for one? I recall needing to perform DNS validation to get the cert approved.

https://docs.aws.amazon.com/acm/latest/userguide/dns-validation.html

Comment on lines +122 to +132
// STEP 2b: Get the validation values for the certificate
if (certificateArn && !recordMap.SSL) {
await getACMValidationValues(domain, models, certificateArn)

// return PENDING to check ACM validation later
return { status, message: 'Certificate issued and validation values stored.' }
}

// STEP 2c: Check ACM validation
const sslVerified = await checkACMValidation(domain, models, recordMap.SSL)
if (!sslVerified) return { status, message: 'ACM validation is still pending.' }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall needing to perform DNS validation to get the cert approved.

Yep! This is where we get the DNS validation values for the ACM certificate and on step 2c we poll ACM to check the result of the DNS validation, finally we update the certificate status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we giving those validation values to the user? How are they being used?

I might've missed where we do this.

Copy link
Member Author

@Soxasora Soxasora May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry, I was editing the first comment to explain how we present this to the user.

This is step 1 of domain guidelines on territory form:
image

After successful DNS verification we present the 2nd step:
image

edit: the domain form is kind of a mess right now and even if presentable to the user, it lacks errors and nice-to-have stuff that I want to include before going fully public.

Copy link
Member

@huumn huumn May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know there were multiple steps!

In local dev, during my QA, I recall SSL was verified without me doing the second step. I could be misremembering though. hmmm

Copy link
Member Author

@Soxasora Soxasora May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misremembering though.

No! You're right, I got lucky grabbing this screenshot.
It polls Domain every 30 seconds in the territory form, and the domainVerification job also runs every 30 seconds, but in dev every step is done really quickly, so timing is a bitch1

Footnotes

  1. himym lol

Copy link
Member

@huumn huumn May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused is this a bug or how it's expected to work in local?

I guess because local ACM doesn't check for this CNAME record, it automatically issues it.


Also, this makes me wonder if the _snverify... TXT check could just be replaced with us checking for the ACM verification record - ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain?

... We wouldn't even need to rely on ACM's verification status ... we could check for the CNAME ourselves ...

I guess the one downside to this is that ACM is only used after the TXT record is verified.

Copy link
Member Author

@Soxasora Soxasora May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because local ACM doesn't check for this CNAME record, it automatically issues it.

Yeah, localstack doesn't simulate timings and stuff like that, so it's kind of a rough experience but it's the closest we can get to mock ACM. A nice thing is that the domainVerification job is prepared for this as it schedules a job after getting the validation values.

ie we reuse ACM verification that the domain is owned by the stacker, as our own verification that the stacker owns the domain

Cool! I definitely like this, less DNS records for the user ^^

we could check for the CNAME ourselves ...

Well, we still poll ACM to know if the certificate is 100% ISSUED (which means fully DNS validated) and be safe before trying to attach it to the load balancer, we can benefit from this by using it as part of domain ownership proof. If it's okay for you, I'd go with that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for it. I think that'd be cleaner UX and I can't think of any reason to not do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Passed first QA without any problems.
Some functions relied heavily on the possibility to have more than one DNS record, like DNS verification or the Territory Edit UI/UX. They've been adjusted for this and also reduced some complexity when it came to SSL verification.

Good call ^^

@huumn
Copy link
Member

huumn commented May 24, 2025

Ah, this is where it fails at scaling. ALBs have a cap of 25 additional ACM certificates and while a quota increase can be requested, it's not ideal and it can be expensive (there's not an easy way to figure out these costs without making a request it seems).

I think we are fine with 25 certs for now. We can program in a limit so that we don't exceed quotas if need be. Once we hit the quota we can request an increase. It's unclear what the maximum quota allowed is, but we should be fine for now.

Copy link
Member

@huumn huumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking this as approved as discussed. I will QA one final time next week and we'll deploy it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants