tot20: Node 20 Deployment Architecture

Overview

Token of Trust migrated its Node.js monolith from Node 16 to Node 20. The migration introduced new tot20-app / tot20-sandbox users on the same servers. This document describes the before and after.

Before: Node 16 (tot users)

Each environment ran under a single Linux user per server:

Environment User Node Version Servers
Sandbox tot-sandbox 16.14.2 sandbox1, sandbox3, sandbox4
Production tot-app 16.14.2 app1, app3, app4

The Ansible group vars pinned the Node version and user:

# group_vars/tot-production.yaml
nodeDefaultVersion: "{{ nodeVersion16 }}"
totUser: "tot-app"
totService:
    node_version: "{{ nodeVersion16 }}"

The Ansible inventory listed all servers in one group:

[tot-production]
app3.tokenoftrust.com
app4.tokenoftrust.com

[tot-sandbox]
sandbox3.tokenoftrust.com
sandbox4.tokenoftrust.com

Deployments ran through a single Bitbucket pipeline (deploy-to-production) targeting the tot-production host group.

After: Node 20 (tot20 users)

The Node 20 configuration introduces a parallel Linux user on each server. The user owns its own nvm installation, PM2 daemon, and application directory, but shares the same EC2 instance, NGINX, and TLS certificates.

Environment User Node Version Servers
Sandbox tot20-sandbox 20.17.0 sandbox1, sandbox3, sandbox4
Production tot20-app 20.17.0 app1, app3, app4

The tot20 group vars change only the Node version and user:

# group_vars/tot20-production.yaml
nodeDefaultVersion: "{{ nodeVersion20 }}"
totUser: "tot20-app"
totService:
    node_version: "{{ nodeVersion20 }}"

Everything else — service name (tot-app), port (31080), hostname (app.tokenoftrust.com), PM2 config structure, certbot domains — is identical. This means NGINX routing, config file paths, and operational tooling don't need to change.

The Ansible inventory now has separate groups:

[tot20-production]
app3.tokenoftrust.com
app4.tokenoftrust.com

[tot20-sandbox]
sandbox3.tokenoftrust.com
sandbox4.tokenoftrust.com

New Bitbucket pipelines (deploy-to-tot20-sandbox, deploy-to-tot20-production) target these groups. The pipeline normalizes environment names for downstream systems (Sentry, build config):

case "${DEPLOY_ENV,,}" in
    sandbox20)   DEPLOY_ENV="sandbox" ;;
    production20) DEPLOY_ENV="production" ;;
esac

FUTURE: Rolling Migration

Although we had a cutover from node 16 to node 20 this time - in the future we would probably be better served to run a rolling migration. In this scenario servers migrate one at a time using DNS-weighted routing, so at least two servers are always handling traffic in each environment.

We used this as we rolled from our 'forked' pm2 processes to 'cluster' configurations by introducing a new 'sandbox1' and 'app1' instance to the mix. The first server in each environment (sandbox1, app1) were setup as the pioneer; the remaining servers follow the same pattern after being drained of traffic, migrated and then re-introduced.

Here's what it looked like - (please note some of this refers to Ansible tooling that only darrin has access to at the moment):

Migration states per environment

Sandbox — before rolling migration:

DNS: sandbox.tokenoftrust.com
  sandbox3    weight=100    (Node 16, tot-sandbox)
  sandbox4    weight=100    (Node 16, tot-sandbox)

Inventory:
  [tot20-sandbox]  sandbox1              <- tot20 deployed, not yet in DNS
  [tot-sandbox]    sandbox3, sandbox4

Sandbox — mid-migration (sandbox3 migrated):

DNS: sandbox.tokenoftrust.com
  sandbox1    weight=100    (Node 20, tot20-sandbox)
  sandbox3    weight=100    (Node 20, tot20-sandbox)
  sandbox4    weight=100    (Node 16, tot-sandbox)

Inventory:
  [tot20-sandbox]  sandbox1, sandbox3
  [tot-sandbox]    sandbox4

Sandbox — fully migrated:

DNS: sandbox.tokenoftrust.com
  sandbox1    weight=100    (Node 20, tot20-sandbox)
  sandbox3    weight=100    (Node 20, tot20-sandbox)
  sandbox4    weight=100    (Node 20, tot20-sandbox)

Inventory:
  [tot20-sandbox]  sandbox1, sandbox3, sandbox4
  [tot-sandbox]    (empty)

Production follows the same progression with app1/app3/app4 and app.tokenoftrust.com.

Per-server migration pattern

Each server moves through these states:

                  DNS                     PM2 (old user)    PM2 (new user)
                  ---                     --------------    --------------
1. Add leader     leader added, wt=100    -                 -
2. Drain target   target set to wt=0      running           -
3. Wait           connections clearing    running           -
4. Disable old    wt=0                    stopped+deleted   -
5. Deploy new     wt=0                    -                 running
6. Restore        target set to wt=100    -                 running

Step 1 (adding the leader) only happens once per environment — sandbox1 for sandbox, app1 for production. After that, the leader stays in the DNS pool for all subsequent migrations.

Between steps 2 and 6, the server being migrated is not receiving new traffic. The remaining servers absorb the load. The drain safety check requires at least 2 active servers (weight > 0) at all times.

Tooling

Two scripts in servers/ansible/scripts/ support the migration:

route53-manage.sh — manages Route53 weighted A records:

./scripts/route53-manage.sh list    sandbox.tokenoftrust.com
./scripts/route53-manage.sh add     sandbox.tokenoftrust.com sandbox1
./scripts/route53-manage.sh drain   sandbox.tokenoftrust.com sandbox3
./scripts/route53-manage.sh restore sandbox.tokenoftrust.com sandbox3

Accepts short aliases (app1), full hostnames (app1.tokenoftrust.com), or raw IPs. The short alias becomes the Route53 SetIdentifier. Drain sets weight to 0; restore sets it back to 100. Drain refuses if fewer than 2 active servers would remain (override with --force). Remove is also available to DELETE a record entirely.

check-nginx-drain.sh — polls active nginx connections via Ansible:

./scripts/check-nginx-drain.sh sandbox3.tokenoftrust.com        # 180s timeout, 0 threshold
./scripts/check-nginx-drain.sh sandbox3.tokenoftrust.com 300 5   # 300s timeout, <=5 connections

Ansible commands for each server

# Disable old PM2 process (dry-run first, then -e dry_run=false)
ansible-playbook -i tot -e hostGroup=tot-sandbox --ask-become-pass \
  --limit sandbox3.tokenoftrust.com playbooks/disable-pm2-services.yml

# Deploy tot20
ansible-playbook -i tot -e hostGroup=tot20-sandbox --ask-become-pass \
  --limit sandbox3.tokenoftrust.com playbooks/deployService.yml

# Verify
ansible -i tot sandbox3.tokenoftrust.com -b -a "sudo -u tot20-sandbox bash -ilc 'pm2 list'"

Substitute the server name, hostGroup, and user for each environment:

Sandbox Production
DNS name sandbox.tokenoftrust.com app.tokenoftrust.com
Old group tot-sandbox tot-production
New group tot20-sandbox tot20-production
Old user tot-sandbox tot-app
New user tot20-sandbox tot20-app

Inventory update

After each server migrates, move it from the old group to the new in the tot inventory file. The inventory is the source of truth for which servers have been migrated. Update it immediately after each successful deploy, before restoring DNS traffic.

 [tot20-sandbox]
 sandbox1.tokenoftrust.com
+sandbox3.tokenoftrust.com

 [tot-sandbox]
-sandbox3.tokenoftrust.com
+# sandbox3.tokenoftrust.com  # migrated to tot20-sandbox
 sandbox4.tokenoftrust.com

Directory Layout Per User

Each user maintains independent blue/green deployment directories:

/home/tot20-app/services/tot-app/
    blue/              # Git checkout + built artifacts
    green/             # Git checkout + built artifacts
    app -> blue        # Symlink to active color
    tot-app.json       # PM2 ecosystem config
    config/            # Environment secrets, google-credentials.json, etc.

The setColor.sh script switches the symlink and restarts PM2. It reads the target color's .nvmrc to ensure the correct Node version is active, and detects whether the PM2 daemon's Node major matches the target — falling back to fork mode if there's a mismatch.

Application-Level Changes

The application code needed minor changes to support the numbered hosts that make up the server cluster:

  • config/*.json: Added additionalHostnames arrays (e.g., ["app1.tokenoftrust.com", ...]) so the CSRF referrer allowlist and login referrer checks accept requests from any host in the cluster.
  • requestUtils.js: Reads additionalHostnames and populates the useForward and loginReferrers maps.
  • updateInstallBuild.sh: Maps tot20-app and tot20-sandbox users to the correct BUILD_ENV.
  • package.json: Adds DISABLE_V8_COMPILE_CACHE=1 to webpack scripts for Node 20 compatibility with older webpack versions.

Provisioning Requirements

The tot20 users need the same credential and config files as the original tot users:

  • SSH keys (~/ssh-tmp/): Required for git clone during first deploy. The deploy playbook detects a missing repo and provides instructions to run the init-git-checkout.yml ansible playbook.
  • Google credentials (../config/google-credentials.json): Service account key for the Google Sheets API (used by the addons service).
  • Any other secrets in the service config directory.

These are one-time setup tasks per server.

Why Separate Users

The Linux user is the natural isolation boundary. Each user has its own:

  • nvm installation with a pinned Node version
  • PM2 daemon (separate pid file, process list, log files)
  • Application directory with independent blue/green checkouts
  • Home directory for SSH keys, config, and credentials

NGINX proxies to a port — it doesn't care which user owns the upstream process. The application reads its config and serves requests — it doesn't care which Node version is running (after compatibility fixes). This separation means the old Node 16 users can remain on each server as an inert fallback, and can be fully retired once the migration is validated.