A story about SSH tunnels, nginx reload, rotating instances, and why you should never hardcode anything. Ever.
TL;DR
I built a High Availability AI Gateway that serves an OpenAI-compatible API at a public HTTPS endpoint, with zero-downtime slot rotation, auto session recovery, multi-replica Cloudflare Tunnel ingress, and full observability piped into Langfuse β using Cloudflare AI Gateway, Cloudflare Tunnel, Fly.io, nginx, and a Python asyncio orchestrator. No Kubernetes. No Helm charts. No mortgage on your soul.
This post is the story of how it happened β the good decisions, the bad decisions, and the one time I accidentally ran two workers simultaneously and spent 10 minutes wondering why auth was happening twice.
The Problem: A Great API Behind a Fragile Gate
Imagine you have access to a private AI inference service. Great models, fast responses, OpenAI-compatible API. One small catch: each session expires after ~60 minutes. Another small catch: instances are dynamic β each one gets a fresh ephemeral SSH endpoint on every spin-up. One more small catch: the auth flow involves captcha solving, IMAP email polling, 2FA, and about 7 HTTP redirects.
You could call it directly. Once. Then your session expires and you're back to square one.
Or β and this is the path we took β you build a proper gateway around it and never think about session management again.
Architecture: The 30,000 Foot View
Internet β βΌ Cloudflare AI Gateway βββ Observability, OTEL, Langfuse β βΌ Cloudflare Tunnel (2 replicas, geo-aware failover) β βΌ Fly.io Machine Β· Singapore Β· nginx :8899 (least_conn, dynamic upstream) β β β βββ Python asyncio orchestrator β βββ auth_loop() β session keeper β βββ pool_loop() β instance lifecycle β βββ cf_watchdog() β tunnel health β βββ status_server() β /health endpoint β βββ SSH local port forward β :8901 β Instance A nginx :8899 β AI API βββ SSH local port forward β :8902 β Instance B nginx :8899 β AI API
Three layers. Each solves a distinct problem:
- Cloudflare layer β public ingress, DDoS protection, observability, zero infra management
- Fly.io orchestration layer β session management, pool rotation, nginx load balancing
- Instance layer β each dynamic AI instance running a local proxy that injects auth headers
Layer 1: Cloudflare Tunnel as the Stable Front Door
The first architectural challenge: how do you give a stable HTTPS URL to a service sitting on a Fly.io machine that could restart, migrate, or crash at 3am?
Answer: Cloudflare Tunnel.
A
cloudflared process running on the Fly.io container creates a persistent outbound tunnel to Cloudflare's edge. No inbound ports. No firewall rules. No "wait, which IP is this again." Just:cloudflared tunnel run --token $CF_TUNNEL_TOKEN
And your service magically appears at
https://your-domain.com.But we went further. Two cloudflared replicas, both pointing at
localhost:8899. Cloudflare's edge routes incoming requests to the geographically nearest healthy replica and fails over automatically if one dies. Our cf_watchdog() coroutine checks every 30 seconds and restarts any dead replica:async def cf_watchdog(): while True: await asyncio.sleep(30) for i, proc in enumerate(_cf_procs): if proc and proc.poll() is not None: _cf_procs[i] = _start_cf_replica(i + 1, token)
It's a 10-line watchdog that has saved us from manual intervention more times than I'd like to admit.
Key insight: Cloudflare Tunnel gives you a stable ingress layer that survives container restarts, IP changes, and most "it worked yesterday" situations β all without exposing a single port to the internet.
Layer 2: The Orchestrator β A Python asyncio Runtime
The Fly.io container runs a single Python process with four concurrent asyncio coroutines:
Coroutine | Responsibility |
auth_loop() | Keeps sessions fresh across all accounts, staggered to avoid rate limits |
pool_loop() | Manages 2 active + 1 idle AI instances, rotates before expiry |
cf_watchdog() | Restarts dead Cloudflare tunnel replicas |
status_server() | Internal HTTP server on :8800 for /health JSON |
All four run as
asyncio.gather(). No threads. No multiprocessing. No "it works until it doesn't" threading bugs.Session Management: The Auth Loop
The private AI service requires authentication that produces a session valid for ~29 days, but instance slots only last ~60 minutes. So there are two separate TTLs to manage:
- Session TTL (~29 days): Auth loop runs periodically (every
floor(60/N_accounts)minutes), re-authenticates before expiry, persists sessions to/data/sessions.jsonon the Fly volume so they survive redeploys.
- Instance TTL (~60 min): Pool loop rotates instances proactively at 50 minutes β giving a 10-minute buffer before the hard expiry.
The auth flow itself deserves its own section. It involves: captcha solving via a third-party API, IMAP email polling for 2FA codes, multiple redirect chains, and cookie extraction. The trickiest bug? A
JSONDecodeError from a response that had an empty body β which was caught by an outer try/except and silently aborted the entire 2FA flow before the verification email was ever sent. We spent 40 minutes staring at logs wondering why emails weren't arriving. They weren't arriving because we never asked for them.Lesson: always scope your exception handlers as tightly as possible. An outer
except Exception that catches everything is a mystery machine.Pool Management: 2 Active + 1 Idle
The pool design follows a classic active-active-reserve pattern:
- Slot A and Slot B: Active instances serving traffic via nginx upstream
- Idle account: Pre-authenticated, waiting in reserve
Every 90 seconds, the health loop:
- Health-checks both active slots
- Checks slot age β if β₯ 50 minutes, triggers proactive rotation
Rotation sequence:
- Spin up a new instance using the idle account
- Add it to nginx pool temporarily (test pool = 3 upstreams)
- Health check the new slot
- If healthy: promote it to replace the old slot, demote old account to idle, destroy old instance
- If unhealthy: rollback, keep old slot, retry next health tick
async def _rotate_slot(slot_id: str, reason: str): async with _rotate_lock: # prevents concurrent rotations # Try idle account first; if it fails (e.g. session invalid), fall back # to re-spinning the slot's own account (still has valid session), then # any other account with a cached session. Avoids the "death spiral" # where one broken idle account blocks rotation forever. for cand in [idle_email, slot_old_email, *spare_with_session]: try: new_slot = await spin_up_instance(cand, get_cached(cand)) used_email = cand break except Exception: continue test_pool = list(_pool_slots.values()) + [new_slot] rewrite_nginx_pool(test_pool) # nginx reload is non-disruptive await asyncio.sleep(3) if not health_check_slot(new_slot): rewrite_nginx_pool(list(_pool_slots.values())) # rollback return # Promote _pool_slots[slot_id] = new_slot if used_email == idle_email: _pool_idle = old_email # only swap idle if we used the idle
The rotation lock (
asyncio.Lock) is critical β without it, simultaneous health failures on both slots could trigger concurrent rotations that race each other into a broken state.Layer 3: The Dynamic Instance Problem β SSH Tunnels All the Way Down
Here's where it gets fun. Each AI instance, when created, gets a fresh ephemeral SSH endpoint β a randomly assigned hostname and port via Pinggy (a TCP tunneling service). There's no static IP. No DNS record. Just something like
xyz-8-222-166-3.run.pinggy-free.link:43723 that changes every rotation.How do you put a dynamic SSH endpoint behind a static nginx upstream?
You don't. You create a local port forward instead.
ssh -p {pinggy_port} -N -f \ -o StrictHostKeyChecking=no \ -o ServerAliveInterval=30 \ -L {local_port}:localhost:8899 root@{pinggy_host}
This creates a persistent background SSH process that tunnels
127.0.0.1:8901 β instance:8899. Nginx just sees a local port. It never knows (or cares) that the actual instance is somewhere across the internet behind a Pinggy tunnel.nginx upstream config (dynamically rewritten on every rotation):
upstream mimo_pool { server 127.0.0.1:8901 max_fails=2 fail_timeout=10s; server 127.0.0.1:8902 max_fails=2 fail_timeout=10s; least_conn; keepalive 16; }
nginx -s reload is non-disruptive β in-flight requests continue to completion on the old config while new requests use the updated upstream list. Zero dropped connections during rotation.On each instance, a shell script (
mimo-proxy.sh --mode local) sets up a local nginx that:- Listens on
:8899
- Proxies to the actual AI API endpoint
- Injects the `api-key` header that the upstream requires
This keeps auth credentials on the instance side β the Fly.io orchestrator never needs to know or handle them.
State Persistence: Surviving Redeploys
Fly.io volumes persist across container restarts. We use
/data to store:/data/sessions.jsonβ all account sessions (so re-auth doesn't happen immediately after redeploy)
/data/pool_state.jsonβ active slot state with Pinggy host/port
On boot,
_load_pool_state() validates saved slots (rejects anything >58 minutes old or missing local_port), then re-establishes SSH port forwards to the saved Pinggy endpoints. If any forward fails (because the instance expired), that slot is dropped and a fresh rotation is triggered.Layer 4: Observability via Cloudflare AI Gateway
Once the gateway was stable, the next question was: who's calling it, what models, how many tokens, how fast?
The cleanest answer turned out to be the one we almost overthought.
Cloudflare AI Gateway has a Custom Providers feature β you give it any HTTPS base URL, and it becomes a first-class gateway provider with all the built-in observability features:
Dashboard β AI Gateway β Custom Providers β Add: - Slug: your-provider - Base URL: https://your-gateway-domain.com
Clients call:
https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/custom-{slug}/v1/chat/completions
Cloudflare intercepts the request, proxies it to your backend, and automatically captures:
- Request model & provider
- Input / output tokens
- Request prompts & completions
- Latency
- Cost estimates
For Langfuse integration: Gateway Settings β OpenTelemetry β Add exporter:
- URL:
https://cloud.langfuse.com/api/public/otel/v1/traces
- Authorization:
Basic <base64(public_key:secret_key)>
Zero code changes. Zero redeploy. Traces appear in Langfuse within seconds.
We briefly considered building a custom Python OTEL proxy (aiohttp middleware between nginx and the SSH forwards). Then we discovered CF AI Gateway Custom Providers and deleted all that code. The right abstraction was already there β we just hadn't looked for it.
Lesson: Before you write the middleware, check if the platform already has it. Usually it does.
Production Numbers
Metric | Value |
Active backends | 2 instances (always) |
Instance TTL | 60 min (rotated at 50 min) |
Session TTL | ~29 days |
Health check interval | 90 seconds |
CF tunnel replicas | 2 (auto-failover) |
State persistence | Fly.io volume /data |
Fly.io region | Singapore (sin) |
VM spec | 1 shared CPU, 256MB RAM |
Observability | CF AI Gateway + Langfuse OTEL |
The entire system runs on a single Fly.io machine with 256MB RAM. The Python orchestrator uses asyncio throughout β no threads, minimal memory footprint, concurrent auth + pool + watchdog in one process.
Things That Went Wrong (A Highlight Reel)
π§ The Ghost Worker: While debugging via SSH, accidentally launched a second worker process. Both ran simultaneously, both tried to auth the same accounts. Logs were... interesting. Killed it immediately, but not before it had already re-authed account 001 and logged a very confused "session already valid" message.
π The Silent 2FA Abort: A
JSONDecodeError on an empty response body was caught by an outer except Exception, which returned "2fa_required" before the email-sending step was ever reached. No emails. No error. Just silence. Took 40 minutes to trace. Scope your exceptions.πͺ€ The Quote Escaping Labyrinth: Passing shell commands through SSH through Python f-strings through subprocess through the Fly console SSH. Single quotes inside double quotes inside f-strings inside bash inside SSH. The regex to validate the fix was longer than the fix itself.
π¦ The "It's in the Secrets" Problem: Hardcoded credentials in a script that got committed to git. Rotation happened. Lesson learned. Now everything is
os.environ.get() with a hard fail if missing.π The Redundant Loop: The WebSocket connection dropped mid-conversation, the retry logic attempted to reconnect with the same ticket from the same stale instance. New connection, old ticket, immediate rejection. Fix: treat WS errors as instance errors, destroy and recreate.
Key Design Principles (or: What I'd Tell Past Me)
- Separate concerns by TTL. Auth sessions (days), instance slots (hours), health checks (minutes), request handling (seconds) β each needs its own management loop.
- Local port forwards are your friend. Dynamic external endpoints + static nginx upstream = SSH
-Lflag. Don't try to make nginx discover dynamic hosts. Make the dynamic hosts look like local ports.
- Persist everything to a volume. Container restarts are facts of life. Sessions, pool state, anything that takes time to rebuild β write it to disk on every update.
- Non-disruptive nginx reload.
nginx -s reloadis atomic. In-flight requests finish on the old config. New requests get the new upstream. Use it freely.
- Use the platform's built-in observability. Before writing a custom proxy for OTEL traces, check if your gateway/CDN already does it. Cloudflare AI Gateway Custom Providers + OTEL took 5 minutes to configure. The custom proxy took 3 hours to write and 30 seconds to delete.
- Two replicas, always. Whether it's Cloudflare Tunnel replicas or active pool slots β single points of failure in infrastructure are scheduled downtime.
The Stack Summary
Component | Technology | Purpose |
Public ingress | Cloudflare AI Gateway | Observability, OTEL, rate limiting |
Tunnel | Cloudflare Tunnel (2 replicas) | Stable HTTPS without exposed ports |
Compute | Fly.io (Singapore) | Container hosting, volume storage |
Load balancer | nginx ( least_conn) | Internal traffic distribution |
Orchestrator | Python asyncio | Auth + pool + watchdog management |
Instance tunneling | Pinggy + SSH -L | Dynamic endpoint β static local port |
Observability | Langfuse via OTLP | Traces, token usage, latency |
State storage | Fly.io volume /data | Session + pool persistence across redeploys |
Closing Thoughts
The final architecture isn't particularly exotic. It's nginx, SSH, asyncio, and Cloudflare β tools that have existed for years. What made it interesting was the constraint: an ephemeral upstream with dynamic endpoints, short-lived sessions, and no static infrastructure to anchor to.
The Cloudflare ecosystem β Tunnel + AI Gateway β turned out to be exactly the right abstraction layer. Tunnel gives you a stable front door without a static IP. AI Gateway gives you observability without a custom proxy. Together, they handle the "how does traffic get in" and "what happened to that traffic" questions, leaving the application layer free to focus on the actual problem: keeping instances alive and sessions fresh.
If you're building AI infrastructure that needs to be resilient, observable, and not require you to wake up at 3am, the Cloudflare ecosystem is worth a serious look. It's not just a CDN anymore.
And if you're inheriting someone else's AI gateway that has hardcoded credentials, a one-liner outer
except Exception, and a comment that says # TODO: fix this from 2 years ago β you now have a template for what to replace it with.Built on Fly.io Singapore Β· Protected by Cloudflare Tunnel Β· Observed by Langfuse Β· Debugged at midnight
Tags:
cloudflare fly.io nginx python asyncio ai-gateway infrastructure ha opentelemetry langfuse ssh-tunneling devops