A monitoring tool that goes down during an outage is worse than useless — it's actively harmful. Your team assumes everything is fine because the dashboard is green, when in reality the dashboard itself is down.
That's why we designed DevHelm's infrastructure for reliability from day one, not as an afterthought.
The Architecture
DevHelm runs on two independent Kubernetes clusters in separate datacenters:
- NYC1 (Primary): Active cluster handling all traffic, running the API, pipeline workers, and edge probes
- SFO3 (Standby): Hot standby with streaming database replication, ready to take over in seconds
The clusters are connected by an encrypted WireGuard tunnel. All cross-DC traffic is encrypted and isolated from the public internet.
Streaming Replication
The primary TimescaleDB instance in NYC1 streams every write to the standby in SFO3 in real-time. This means:
- Zero data loss on failover — the standby has every check result, every alert, every metric
- Read queries can be served from the standby for geographic distribution
- Recovery time is measured in seconds, not minutes
Automated Failover
We built a purpose-built failover controller that runs in the SFO3 cluster. It continuously monitors the health of the NYC1 cluster through multiple independent health checks.
When it detects a sustained failure (not a transient blip):
- Confirms the outage via multiple probe paths
- Promotes the SFO3 TimescaleDB to primary
- Switches DNS/traffic to the new primary
- Alerts the team (us) for post-incident review
The entire sequence completes in under 30 seconds with zero manual intervention.
Edge Probes
Your uptime checks run from 5 independent probe regions across 5 continents. Each probe:
- Runs in an isolated Kubernetes namespace with strict NetworkPolicy
- Has no shared infrastructure with the application tier
- Performs independent DNS resolution to detect region-specific issues
- Requires multi-region confirmation before triggering alerts
Why This Matters
Most monitoring SaaS products don't publish anything about their infrastructure. We do because we think it matters — if you're trusting a tool to tell you when things break, you should know how that tool is built.
Read more about our reliability commitment on our Trust & Reliability page.