tot20: Node 20 Deployment Architecture
Overview
Token of Trust migrated its Node.js monolith from Node 16 to Node 20. The migration introduced new tot20-app / tot20-sandbox users on the same servers. This document describes the before and after.
Before: Node 16 (tot users)
Each environment ran under a single Linux user per server:
| Environment | User | Node Version | Servers |
|---|---|---|---|
| Sandbox | tot-sandbox |
16.14.2 | sandbox1, sandbox3, sandbox4 |
| Production | tot-app |
16.14.2 | app1, app3, app4 |
The Ansible group vars pinned the Node version and user:
# group_vars/tot-production.yaml
nodeDefaultVersion: "{{ nodeVersion16 }}"
totUser: "tot-app"
totService:
node_version: "{{ nodeVersion16 }}"
The Ansible inventory listed all servers in one group:
[tot-production]
app3.tokenoftrust.com
app4.tokenoftrust.com
[tot-sandbox]
sandbox3.tokenoftrust.com
sandbox4.tokenoftrust.com
Deployments ran through a single Bitbucket pipeline (deploy-to-production) targeting the tot-production host group.
After: Node 20 (tot20 users)
The Node 20 configuration introduces a parallel Linux user on each server. The user owns its own nvm installation, PM2 daemon, and application directory, but shares the same EC2 instance, NGINX, and TLS certificates.
| Environment | User | Node Version | Servers |
|---|---|---|---|
| Sandbox | tot20-sandbox |
20.17.0 | sandbox1, sandbox3, sandbox4 |
| Production | tot20-app |
20.17.0 | app1, app3, app4 |
The tot20 group vars change only the Node version and user:
# group_vars/tot20-production.yaml
nodeDefaultVersion: "{{ nodeVersion20 }}"
totUser: "tot20-app"
totService:
node_version: "{{ nodeVersion20 }}"
Everything else — service name (tot-app), port (31080), hostname (app.tokenoftrust.com), PM2 config structure, certbot domains — is identical. This means NGINX routing, config file paths, and operational tooling don't need to change.
The Ansible inventory now has separate groups:
[tot20-production]
app3.tokenoftrust.com
app4.tokenoftrust.com
[tot20-sandbox]
sandbox3.tokenoftrust.com
sandbox4.tokenoftrust.com
New Bitbucket pipelines (deploy-to-tot20-sandbox, deploy-to-tot20-production) target these groups. The pipeline normalizes environment names for downstream systems (Sentry, build config):
case "${DEPLOY_ENV,,}" in
sandbox20) DEPLOY_ENV="sandbox" ;;
production20) DEPLOY_ENV="production" ;;
esac
FUTURE: Rolling Migration
Although we had a cutover from node 16 to node 20 this time - in the future we would probably be better served to run a rolling migration. In this scenario servers migrate one at a time using DNS-weighted routing, so at least two servers are always handling traffic in each environment.
We used this as we rolled from our 'forked' pm2 processes to 'cluster' configurations by introducing a new 'sandbox1' and 'app1' instance to the mix. The first server in each environment (sandbox1, app1) were setup as the pioneer; the remaining servers follow the same pattern after being drained of traffic, migrated and then re-introduced.
Here's what it looked like - (please note some of this refers to Ansible tooling that only darrin has access to at the moment):
Migration states per environment
Sandbox — before rolling migration:
DNS: sandbox.tokenoftrust.com
sandbox3 weight=100 (Node 16, tot-sandbox)
sandbox4 weight=100 (Node 16, tot-sandbox)
Inventory:
[tot20-sandbox] sandbox1 <- tot20 deployed, not yet in DNS
[tot-sandbox] sandbox3, sandbox4
Sandbox — mid-migration (sandbox3 migrated):
DNS: sandbox.tokenoftrust.com
sandbox1 weight=100 (Node 20, tot20-sandbox)
sandbox3 weight=100 (Node 20, tot20-sandbox)
sandbox4 weight=100 (Node 16, tot-sandbox)
Inventory:
[tot20-sandbox] sandbox1, sandbox3
[tot-sandbox] sandbox4
Sandbox — fully migrated:
DNS: sandbox.tokenoftrust.com
sandbox1 weight=100 (Node 20, tot20-sandbox)
sandbox3 weight=100 (Node 20, tot20-sandbox)
sandbox4 weight=100 (Node 20, tot20-sandbox)
Inventory:
[tot20-sandbox] sandbox1, sandbox3, sandbox4
[tot-sandbox] (empty)
Production follows the same progression with app1/app3/app4 and app.tokenoftrust.com.
Per-server migration pattern
Each server moves through these states:
DNS PM2 (old user) PM2 (new user)
--- -------------- --------------
1. Add leader leader added, wt=100 - -
2. Drain target target set to wt=0 running -
3. Wait connections clearing running -
4. Disable old wt=0 stopped+deleted -
5. Deploy new wt=0 - running
6. Restore target set to wt=100 - running
Step 1 (adding the leader) only happens once per environment — sandbox1 for sandbox, app1 for production. After that, the leader stays in the DNS pool for all subsequent migrations.
Between steps 2 and 6, the server being migrated is not receiving new traffic. The remaining servers absorb the load. The drain safety check requires at least 2 active servers (weight > 0) at all times.
Tooling
Two scripts in servers/ansible/scripts/ support the migration:
route53-manage.sh — manages Route53 weighted A records:
./scripts/route53-manage.sh list sandbox.tokenoftrust.com
./scripts/route53-manage.sh add sandbox.tokenoftrust.com sandbox1
./scripts/route53-manage.sh drain sandbox.tokenoftrust.com sandbox3
./scripts/route53-manage.sh restore sandbox.tokenoftrust.com sandbox3
Accepts short aliases (app1), full hostnames (app1.tokenoftrust.com), or raw IPs. The short alias becomes the Route53 SetIdentifier. Drain sets weight to 0; restore sets it back to 100. Drain refuses if fewer than 2 active servers would remain (override with --force). Remove is also available to DELETE a record entirely.
check-nginx-drain.sh — polls active nginx connections via Ansible:
./scripts/check-nginx-drain.sh sandbox3.tokenoftrust.com # 180s timeout, 0 threshold
./scripts/check-nginx-drain.sh sandbox3.tokenoftrust.com 300 5 # 300s timeout, <=5 connections
Ansible commands for each server
# Disable old PM2 process (dry-run first, then -e dry_run=false)
ansible-playbook -i tot -e hostGroup=tot-sandbox --ask-become-pass \
--limit sandbox3.tokenoftrust.com playbooks/disable-pm2-services.yml
# Deploy tot20
ansible-playbook -i tot -e hostGroup=tot20-sandbox --ask-become-pass \
--limit sandbox3.tokenoftrust.com playbooks/deployService.yml
# Verify
ansible -i tot sandbox3.tokenoftrust.com -b -a "sudo -u tot20-sandbox bash -ilc 'pm2 list'"
Substitute the server name, hostGroup, and user for each environment:
| Sandbox | Production | |
|---|---|---|
| DNS name | sandbox.tokenoftrust.com |
app.tokenoftrust.com |
| Old group | tot-sandbox |
tot-production |
| New group | tot20-sandbox |
tot20-production |
| Old user | tot-sandbox |
tot-app |
| New user | tot20-sandbox |
tot20-app |
Inventory update
After each server migrates, move it from the old group to the new in the tot inventory file. The inventory is the source of truth for which servers have been migrated. Update it immediately after each successful deploy, before restoring DNS traffic.
[tot20-sandbox]
sandbox1.tokenoftrust.com
+sandbox3.tokenoftrust.com
[tot-sandbox]
-sandbox3.tokenoftrust.com
+# sandbox3.tokenoftrust.com # migrated to tot20-sandbox
sandbox4.tokenoftrust.com
Directory Layout Per User
Each user maintains independent blue/green deployment directories:
/home/tot20-app/services/tot-app/
blue/ # Git checkout + built artifacts
green/ # Git checkout + built artifacts
app -> blue # Symlink to active color
tot-app.json # PM2 ecosystem config
config/ # Environment secrets, google-credentials.json, etc.
The setColor.sh script switches the symlink and restarts PM2. It reads the target color's .nvmrc to ensure the correct Node version is active, and detects whether the PM2 daemon's Node major matches the target — falling back to fork mode if there's a mismatch.
Application-Level Changes
The application code needed minor changes to support the numbered hosts that make up the server cluster:
config/*.json: AddedadditionalHostnamesarrays (e.g.,["app1.tokenoftrust.com", ...]) so the CSRF referrer allowlist and login referrer checks accept requests from any host in the cluster.requestUtils.js: ReadsadditionalHostnamesand populates theuseForwardandloginReferrersmaps.updateInstallBuild.sh: Mapstot20-appandtot20-sandboxusers to the correctBUILD_ENV.package.json: AddsDISABLE_V8_COMPILE_CACHE=1to webpack scripts for Node 20 compatibility with older webpack versions.
Provisioning Requirements
The tot20 users need the same credential and config files as the original tot users:
- SSH keys (
~/ssh-tmp/): Required forgit cloneduring first deploy. The deploy playbook detects a missing repo and provides instructions to run theinit-git-checkout.ymlansible playbook. - Google credentials (
../config/google-credentials.json): Service account key for the Google Sheets API (used by the addons service). - Any other secrets in the service config directory.
These are one-time setup tasks per server.
Why Separate Users
The Linux user is the natural isolation boundary. Each user has its own:
- nvm installation with a pinned Node version
- PM2 daemon (separate pid file, process list, log files)
- Application directory with independent blue/green checkouts
- Home directory for SSH keys, config, and credentials
NGINX proxies to a port — it doesn't care which user owns the upstream process. The application reads its config and serves requests — it doesn't care which Node version is running (after compatibility fixes). This separation means the old Node 16 users can remain on each server as an inert fallback, and can be fully retired once the migration is validated.