Disaster Recovery

This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS. The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.

A zonal recovery is required when all resources are unavailable in one of the three availability zones in us-east1 or us-central1.
A regional recovery is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either us-east1 or us-central1.

Services not included in the current DR strategy for FY24 and FY25

We have limited the scope of DR to services that support primary services (Web, API, Git, Pages, Sidekiq, CI, and Registry). These services tie directly into our overall availability score (internal link) for GitLab.com.

For example, DR does not include the following:

AI services including code suggestions
Error tracking and other observability services like tracing
CustomersDot, responsible for billing and new subscriptions
Advanced Search

DR Implementation Targets

The FY24 targets were:

	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Zonal	2 hours	1 hour
Regional	96 hours	2 hours

The FY25 targets before cell architecture are:

	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Zonal	0 minutes	0 minutes
Regional	48 hours	0 minutes

Note: While the RPO values are targets, they cannot be met exactly due to the limitations of regional bucket replication and replication lag of Gitaly and PostgreSQL.

Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Zonal Recovery

We have not yet simulated a full zonal outage on GitLab.com. The following are RTO/RPO estimates based on what we have been able to test using the disaster recovery runbook. It is assumed that each service can be restored in parallel. A parallel restore is the only way we are able to meet the FY24 RTO target of 2 hours for a zonal recovery.

Service	RTO	RPO
PostgreSQL	1.5 hr	<=5 min
Redis ¹	0	0
Gitaly	30 min	<=1 hr
CI	30 min	not applicable
Load balancing (HAProxy)	30 min	not applicable
Frontend services (Web, API, Git, Pages, Registry) ²	15 min	0
Monitoring (Prometheus, Thanos, Grafana, Alerting)	0	not applicable
Operations (Deployments, runbooks, operational tooling, Chef) ³	30 min	4 hr
PackageCloud (distribution of packages for self-managed)	0	0

Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery

Regional recovery requires a complete rebuild of GitLab.com using backups that are stored in multi-region buckets. The recovery has not yet been validated end-to-end, so we don't know how long the RTO is for a regional failure. Our target RTO for FY25 is to have a procedure to recover from a regional outage in under 48 hours.

The following are considerations for choosing multi-region buckets over dual-region buckets:

We operate out of a single region so multi-region storage is only used for disaster recovery.
Although Google recommends dual-region for disaster recovery, dual-region is not an available storage type for disk snapshots.
To mitigate the bandwidth limitation of multi-region buckets, we spread Gitaly VMs infra across multiple projects.

Proposals for Regional and Zonal Recovery

Most of the Redis load is on the primary node, so losing replicas should not cause any service interruption ↩
We setup maximum replicas in our Kubernetes clusters servicing front-end traffic, this is done to avoid saturating downstream dependencies. For a zonal failure, a cluster reconfiguration is necessary to increase these maximums. ↩
There is a 4 hr RPO for Operations because Chef is an single point of failure in a single availability zone and our restore method uses disk snapshots, taken every 4 hours. While most of our Chef configuration is also stored in Git, some data (like node registrations) are only stored on the server. ↩