How We Built DevHelm for 99.99% Uptime

A monitoring tool that goes down during an outage is worse than useless — it's actively harmful. Your team assumes everything is fine because the dashboard is green, when in reality the dashboard itself is down.

That's why we designed DevHelm's infrastructure for reliability from day one, not as an afterthought.

The Architecture

DevHelm runs on two independent Kubernetes clusters in separate datacenters:

NYC1 (Primary): Active cluster handling all traffic, running the API, pipeline workers, and edge probes
SFO3 (Standby): Hot standby with streaming database replication, ready to take over in seconds

The clusters are connected by an encrypted WireGuard tunnel. All cross-DC traffic is encrypted and isolated from the public internet.

Streaming Replication

The primary TimescaleDB instance in NYC1 streams every write to the standby in SFO3 in real-time. This means:

Zero data loss on failover — the standby has every check result, every alert, every metric
Read queries can be served from the standby for geographic distribution
Recovery time is measured in seconds, not minutes

Automated Failover

We built a purpose-built failover controller that runs in the SFO3 cluster. It continuously monitors the health of the NYC1 cluster through multiple independent health checks.

When it detects a sustained failure (not a transient blip):

Confirms the outage via multiple probe paths
Promotes the SFO3 TimescaleDB to primary
Switches DNS/traffic to the new primary
Alerts the team (us) for post-incident review

The entire sequence completes in under 30 seconds with zero manual intervention.

Edge Probes

Your uptime checks run from 5 independent probe regions across 5 continents. Each probe:

Runs in an isolated Kubernetes namespace with strict NetworkPolicy
Has no shared infrastructure with the application tier
Performs independent DNS resolution to detect region-specific issues
Requires multi-region confirmation before triggering alerts

Why This Matters

Most monitoring SaaS products don't publish anything about their infrastructure. We do because we think it matters — if you're trusting a tool to tell you when things break, you should know how that tool is built.

Read more about our reliability commitment on our Trust & Reliability page.