Building a Production-Grade HA AI Gateway on Cloudflare Ecosystem - From "It Works on My Machine" to Conference-Ready Infrastructure

A story about SSH tunnels, nginx reload, rotating instances, and why you should never hardcode anything. Ever.

TL;DR

I built a High Availability AI Gateway that serves an OpenAI-compatible API at a public HTTPS endpoint, with zero-downtime slot rotation, auto session recovery, multi-replica Cloudflare Tunnel ingress, and full observability piped into Langfuse — using Cloudflare AI Gateway, Cloudflare Tunnel, Fly.io, nginx, and a Python asyncio orchestrator. No Kubernetes. No Helm charts. No mortgage on your soul.

This post is the story of how it happened — the good decisions, the bad decisions, and the one time I accidentally ran two workers simultaneously and spent 10 minutes wondering why auth was happening twice.

The Problem: A Great API Behind a Fragile Gate

Imagine you have access to a private AI inference service. Great models, fast responses, OpenAI-compatible API. One small catch: each session expires after ~60 minutes. Another small catch: instances are dynamic — each one gets a fresh ephemeral SSH endpoint on every spin-up. One more small catch: the auth flow involves captcha solving, IMAP email polling, 2FA, and about 7 HTTP redirects.

You could call it directly. Once. Then your session expires and you're back to square one.

Or — and this is the path we took — you build a proper gateway around it and never think about session management again.

Architecture: The 30,000 Foot View


Internet
   │
   ▼
Cloudflare AI Gateway  ←── Observability, OTEL, Langfuse
   │
   ▼
Cloudflare Tunnel (2 replicas, geo-aware failover)
   │
   ▼
Fly.io Machine · Singapore · nginx :8899  (least_conn, dynamic upstream)
   │          │
   │          └── Python asyncio orchestrator
   │               ├── auth_loop()     — session keeper
   │               ├── pool_loop()     — instance lifecycle
   │               ├── cf_watchdog()   — tunnel health
   │               └── status_server() — /health endpoint
   │
   ├── SSH local port forward → :8901 → Instance A nginx :8899 → AI API
   └── SSH local port forward → :8902 → Instance B nginx :8899 → AI API

Three layers. Each solves a distinct problem:

Cloudflare layer — public ingress, DDoS protection, observability, zero infra management

Fly.io orchestration layer — session management, pool rotation, nginx load balancing

Instance layer — each dynamic AI instance running a local proxy that injects auth headers

Layer 1: Cloudflare Tunnel as the Stable Front Door

The first architectural challenge: how do you give a stable HTTPS URL to a service sitting on a Fly.io machine that could restart, migrate, or crash at 3am?

Answer: Cloudflare Tunnel.

A cloudflared process running on the Fly.io container creates a persistent outbound tunnel to Cloudflare's edge. No inbound ports. No firewall rules. No "wait, which IP is this again." Just:


cloudflared tunnel run --token $CF_TUNNEL_TOKEN

And your service magically appears at https://your-domain.com.

But we went further. Two cloudflared replicas, both pointing at localhost:8899. Cloudflare's edge routes incoming requests to the geographically nearest healthy replica and fails over automatically if one dies. Our cf_watchdog() coroutine checks every 30 seconds and restarts any dead replica:


async def cf_watchdog():
    while True:
        await asyncio.sleep(30)
        for i, proc in enumerate(_cf_procs):
            if proc and proc.poll() is not None:
                _cf_procs[i] = _start_cf_replica(i + 1, token)

It's a 10-line watchdog that has saved us from manual intervention more times than I'd like to admit.

Key insight: Cloudflare Tunnel gives you a stable ingress layer that survives container restarts, IP changes, and most "it worked yesterday" situations — all without exposing a single port to the internet.

Layer 2: The Orchestrator — A Python asyncio Runtime

The Fly.io container runs a single Python process with four concurrent asyncio coroutines:

Coroutine	Responsibility
`auth_loop()`	Keeps sessions fresh across all accounts, staggered to avoid rate limits
`pool_loop()`	Manages 2 active + 1 idle AI instances, rotates before expiry
`cf_watchdog()`	Restarts dead Cloudflare tunnel replicas
`status_server()`	Internal HTTP server on :8800 for `/health` JSON

All four run as asyncio.gather(). No threads. No multiprocessing. No "it works until it doesn't" threading bugs.

Session Management: The Auth Loop

The private AI service requires authentication that produces a session valid for ~29 days, but instance slots only last ~60 minutes. So there are two separate TTLs to manage:

Session TTL (~29 days): Auth loop runs periodically (every floor(60/N_accounts) minutes), re-authenticates before expiry, persists sessions to /data/sessions.json on the Fly volume so they survive redeploys.

Instance TTL (~60 min): Pool loop rotates instances proactively at 50 minutes — giving a 10-minute buffer before the hard expiry.

The auth flow itself deserves its own section. It involves: captcha solving via a third-party API, IMAP email polling for 2FA codes, multiple redirect chains, and cookie extraction. The trickiest bug? A JSONDecodeError from a response that had an empty body — which was caught by an outer try/except and silently aborted the entire 2FA flow before the verification email was ever sent. We spent 40 minutes staring at logs wondering why emails weren't arriving. They weren't arriving because we never asked for them.

Lesson: always scope your exception handlers as tightly as possible. An outer except Exception that catches everything is a mystery machine.

Pool Management: 2 Active + 1 Idle

The pool design follows a classic active-active-reserve pattern:

Slot A and Slot B: Active instances serving traffic via nginx upstream

Idle account: Pre-authenticated, waiting in reserve

Every 90 seconds, the health loop:

Health-checks both active slots

Checks slot age — if ≥ 50 minutes, triggers proactive rotation

Rotation sequence:

Spin up a new instance using the idle account

Add it to nginx pool temporarily (test pool = 3 upstreams)

Health check the new slot

If healthy: promote it to replace the old slot, demote old account to idle, destroy old instance

If unhealthy: rollback, keep old slot, retry next health tick


async def _rotate_slot(slot_id: str, reason: str):
    async with _rotate_lock:   # prevents concurrent rotations
        # Try idle account first; if it fails (e.g. session invalid), fall back
        # to re-spinning the slot's own account (still has valid session), then
        # any other account with a cached session. Avoids the "death spiral"
        # where one broken idle account blocks rotation forever.
        for cand in [idle_email, slot_old_email, *spare_with_session]:
            try:
                new_slot = await spin_up_instance(cand, get_cached(cand))
                used_email = cand
                break
            except Exception:
                continue
        test_pool = list(_pool_slots.values()) + [new_slot]
        rewrite_nginx_pool(test_pool)   # nginx reload is non-disruptive
        await asyncio.sleep(3)
        if not health_check_slot(new_slot):
            rewrite_nginx_pool(list(_pool_slots.values()))  # rollback
            return
        # Promote
        _pool_slots[slot_id] = new_slot
        if used_email == idle_email:
            _pool_idle = old_email   # only swap idle if we used the idle

The rotation lock (asyncio.Lock) is critical — without it, simultaneous health failures on both slots could trigger concurrent rotations that race each other into a broken state.

Layer 3: The Dynamic Instance Problem — SSH Tunnels All the Way Down

Here's where it gets fun. Each AI instance, when created, gets a fresh ephemeral SSH endpoint — a randomly assigned hostname and port via Pinggy (a TCP tunneling service). There's no static IP. No DNS record. Just something like xyz-8-222-166-3.run.pinggy-free.link:43723 that changes every rotation.

How do you put a dynamic SSH endpoint behind a static nginx upstream?

You don't. You create a local port forward instead.


ssh -p {pinggy_port} -N -f \
  -o StrictHostKeyChecking=no \
  -o ServerAliveInterval=30 \
  -L {local_port}:localhost:8899 root@{pinggy_host}

This creates a persistent background SSH process that tunnels 127.0.0.1:8901 → instance:8899. Nginx just sees a local port. It never knows (or cares) that the actual instance is somewhere across the internet behind a Pinggy tunnel.

nginx upstream config (dynamically rewritten on every rotation):


upstream mimo_pool {
    server 127.0.0.1:8901 max_fails=2 fail_timeout=10s;
    server 127.0.0.1:8902 max_fails=2 fail_timeout=10s;
    least_conn;
    keepalive 16;
}

nginx -s reload is non-disruptive — in-flight requests continue to completion on the old config while new requests use the updated upstream list. Zero dropped connections during rotation.

On each instance, a shell script (mimo-proxy.sh --mode local) sets up a local nginx that:

Listens on :8899

Proxies to the actual AI API endpoint

Injects the `api-key` header that the upstream requires

This keeps auth credentials on the instance side — the Fly.io orchestrator never needs to know or handle them.

State Persistence: Surviving Redeploys

Fly.io volumes persist across container restarts. We use /data to store:

/data/sessions.json — all account sessions (so re-auth doesn't happen immediately after redeploy)

/data/pool_state.json — active slot state with Pinggy host/port

On boot, _load_pool_state() validates saved slots (rejects anything >58 minutes old or missing local_port), then re-establishes SSH port forwards to the saved Pinggy endpoints. If any forward fails (because the instance expired), that slot is dropped and a fresh rotation is triggered.

Layer 4: Observability via Cloudflare AI Gateway

Once the gateway was stable, the next question was: who's calling it, what models, how many tokens, how fast?

The cleanest answer turned out to be the one we almost overthought.

Cloudflare AI Gateway has a Custom Providers feature — you give it any HTTPS base URL, and it becomes a first-class gateway provider with all the built-in observability features:


Dashboard → AI Gateway → Custom Providers → Add:
  - Slug: your-provider
  - Base URL: https://your-gateway-domain.com

Clients call:


https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/custom-{slug}/v1/chat/completions

Cloudflare intercepts the request, proxies it to your backend, and automatically captures:

Request model & provider

Input / output tokens

Request prompts & completions

Latency

Cost estimates

For Langfuse integration: Gateway Settings → OpenTelemetry → Add exporter:

URL: https://cloud.langfuse.com/api/public/otel/v1/traces

Authorization: Basic <base64(public_key:secret_key)>

Zero code changes. Zero redeploy. Traces appear in Langfuse within seconds.

We briefly considered building a custom Python OTEL proxy (aiohttp middleware between nginx and the SSH forwards). Then we discovered CF AI Gateway Custom Providers and deleted all that code. The right abstraction was already there — we just hadn't looked for it.

Lesson: Before you write the middleware, check if the platform already has it. Usually it does.

Production Numbers

Metric	Value
Active backends	2 instances (always)
Instance TTL	60 min (rotated at 50 min)
Session TTL	~29 days
Health check interval	90 seconds
CF tunnel replicas	2 (auto-failover)
State persistence	Fly.io volume `/data`
Fly.io region	Singapore (sin)
VM spec	1 shared CPU, 256MB RAM
Observability	CF AI Gateway + Langfuse OTEL

The entire system runs on a single Fly.io machine with 256MB RAM. The Python orchestrator uses asyncio throughout — no threads, minimal memory footprint, concurrent auth + pool + watchdog in one process.

Things That Went Wrong (A Highlight Reel)

🧟 The Ghost Worker: While debugging via SSH, accidentally launched a second worker process. Both ran simultaneously, both tried to auth the same accounts. Logs were... interesting. Killed it immediately, but not before it had already re-authed account 001 and logged a very confused "session already valid" message.

🔇 The Silent 2FA Abort: A JSONDecodeError on an empty response body was caught by an outer except Exception, which returned "2fa_required" before the email-sending step was ever reached. No emails. No error. Just silence. Took 40 minutes to trace. Scope your exceptions.

🪤 The Quote Escaping Labyrinth: Passing shell commands through SSH through Python f-strings through subprocess through the Fly console SSH. Single quotes inside double quotes inside f-strings inside bash inside SSH. The regex to validate the fix was longer than the fix itself.

📦 The "It's in the Secrets" Problem: Hardcoded credentials in a script that got committed to git. Rotation happened. Lesson learned. Now everything is os.environ.get() with a hard fail if missing.

🔁 The Redundant Loop: The WebSocket connection dropped mid-conversation, the retry logic attempted to reconnect with the same ticket from the same stale instance. New connection, old ticket, immediate rejection. Fix: treat WS errors as instance errors, destroy and recreate.

Key Design Principles (or: What I'd Tell Past Me)

Separate concerns by TTL. Auth sessions (days), instance slots (hours), health checks (minutes), request handling (seconds) — each needs its own management loop.

Local port forwards are your friend. Dynamic external endpoints + static nginx upstream = SSH -L flag. Don't try to make nginx discover dynamic hosts. Make the dynamic hosts look like local ports.

Persist everything to a volume. Container restarts are facts of life. Sessions, pool state, anything that takes time to rebuild — write it to disk on every update.

Non-disruptive nginx reload. nginx -s reload is atomic. In-flight requests finish on the old config. New requests get the new upstream. Use it freely.

Use the platform's built-in observability. Before writing a custom proxy for OTEL traces, check if your gateway/CDN already does it. Cloudflare AI Gateway Custom Providers + OTEL took 5 minutes to configure. The custom proxy took 3 hours to write and 30 seconds to delete.

Two replicas, always. Whether it's Cloudflare Tunnel replicas or active pool slots — single points of failure in infrastructure are scheduled downtime.

The Stack Summary

Component	Technology	Purpose
Public ingress	Cloudflare AI Gateway	Observability, OTEL, rate limiting
Tunnel	Cloudflare Tunnel (2 replicas)	Stable HTTPS without exposed ports
Compute	Fly.io (Singapore)	Container hosting, volume storage
Load balancer	nginx (`least_conn`)	Internal traffic distribution
Orchestrator	Python asyncio	Auth + pool + watchdog management
Instance tunneling	Pinggy + SSH `-L`	Dynamic endpoint → static local port
Observability	Langfuse via OTLP	Traces, token usage, latency
State storage	Fly.io volume `/data`	Session + pool persistence across redeploys

Closing Thoughts

The final architecture isn't particularly exotic. It's nginx, SSH, asyncio, and Cloudflare — tools that have existed for years. What made it interesting was the constraint: an ephemeral upstream with dynamic endpoints, short-lived sessions, and no static infrastructure to anchor to.

The Cloudflare ecosystem — Tunnel + AI Gateway — turned out to be exactly the right abstraction layer. Tunnel gives you a stable front door without a static IP. AI Gateway gives you observability without a custom proxy. Together, they handle the "how does traffic get in" and "what happened to that traffic" questions, leaving the application layer free to focus on the actual problem: keeping instances alive and sessions fresh.

If you're building AI infrastructure that needs to be resilient, observable, and not require you to wake up at 3am, the Cloudflare ecosystem is worth a serious look. It's not just a CDN anymore.

And if you're inheriting someone else's AI gateway that has hardcoded credentials, a one-liner outer except Exception, and a comment that says # TODO: fix this from 2 years ago — you now have a template for what to replace it with.

Built on Fly.io Singapore · Protected by Cloudflare Tunnel · Observed by Langfuse · Debugged at midnight

Tags: cloudflare fly.io nginx python asyncio ai-gateway infrastructure ha opentelemetry langfuse ssh-tunneling devops