Regional Recovery

Improving the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery

The following list the top challenges that limit our ability to drive RTO to 48 hours for a regional recovery.

We have a large amount of legacy infrastructure managed using Chef. This configuration has been difficult for us to manage and would require a large a mount of manual copying and duplication to create new infrastructure in an alternate region.
Operational infrastructure is located in a single region, us-central1. For a regional failure in this region, it requires rebuilding the ops infrastructure with only local copies of runbooks and tooling scripts.
Observability is hosted in a single region.
The infrastructure (dev.gitlab.org) that builds Docker images and packages is located in a single region, and is a single point of failure.
There is no launch-pad that would allow us to get a head-start on a regional recovery. Our IaC (Infrastructure-as-Code) does not allow us to switch regions for provisioning.
We don't have confidence that Google can provide us with the capacity we need in a new region, specifically the large amount of SSD necessary to restore all of our customer Git data.
We use Global DNS for internal DNS making it difficult to use multiple instances with the same name across multiple regions, we also don't incorporate regions into DNS names for our internal endpoints (for example dashboards, logs, etc).
If we deploy replicas in another region to reduce RPO we are not yet sure of the latency or cloud spend impacts.
We have special/negotiated Quota increases for Compute, Network, and API with the Google Cloud Platform only for a single region, we have to match these quotas in a new region, and keep them in sync.
We have not standardized a way to divert traffic at the edge from 1 region to another.
In monitoring, and configuration we have places where we hardcode the region to us-east1.

Regional recovery work-streams

The first step of our regional recovery plan creates new infrastructure in the recovery region that involves a large number of manual steps. To give us a head-start on recovery, we propose a "regional bulkhead" deployment in a new GCP region.

A "regional bulkhead" meets the following requirements:

A specific region is allocated.
Quotas are set and synced so that we can duplicate all of us-east1 in the new region.
Subnets are allocated or reserved in the same VPC for us-east1.
Some infrastructure is deployed where it makes sense to lower RTO, while keeping cloud-spend low.

The following are work-streams that can be done mostly in parallel. The end-goal of the regional recovery is to have a bulkhead that has the basic scaffolding for deployment in the alternate region. This bulkhead can be used as a launching pad for a full data restore from us-east1 to the alternate region.

Select an alternate region

We are going with us-central1. Discussion for this was done in https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25094

Dependencies: none
Teams: Ops

The following are considerations that need to be made when selecting an alternate region for DR:

Ensure there is enough capacity to meet compute usage.
Network and network latency requirements, if any.
Feature parity between regions.

Deploy Kubernetes clusters supporting front-end services in a new region with deployments

Dependencies: External front-end load balancing
Teams: Ops, Foundations, Delivery

GitLab.com has Web, API, Git, Git HTTPs, Git SSH, Pages, and Registry as front-end services. All of these services are run in 4 Kubernetes clusters deployed in us-east1. These services are either stateless or use multi-region storage buckets for data. In the case of a failure in us-east1, we would need to rebuild these clusters in the alternate region and set them up for deployments.

Switch from Global to Zonal DNS

Dependencies: None
Teams: Gitaly

Gitaly VMs are single points of failure that are deployed in us-east1. The internal DNS naming of the nodes have the following convention:

gitaly-01-stor-gprd.c.gitlab-gitaly-gprd-ccb0.internal
 ^ name                    ^ project

By switching to zonal DNS, we can change the internal DNS entries so they have the zone in the DNS name:

gitaly-01-stor-gprd.c.us-east1-b.gitlab-gitaly-gprd-ccb0.internal
 ^ name                  ^ zone     ^ project

Allowing us to keep the same name when recovering into a new region or zone.

gitaly-01-stor-gprd.c.us-east1-b.gitlab-gitaly-gprd-ccb0.internal
gitaly-01-stor-gprd.c.us-east4-a.gitlab-gitaly-gprd-ccb0.internal

For fleets of VMs outside of Kubernetes, these names allow us to have the same node names in the recovery region.

Gitaly

Dependencies: Switch from Global to Zonal DNS (optional, but desired)
Teams: Gitaly, Ops, Foundations

Restoring the entire Gitaly fleet requires a large number of VMs deployed in the alternate region. It also requires a lot of bandwidth because restore is based on disk snapshots. To ensure a successful Gitaly restore, quotas need to be synced with us-east1 and there needs to be end-to-end validation.

PostgreSQL

Dependencies: Improve Chef provisioning time by using preconfigured golden OS images (optional, but desired), local backups in the standby region (data disk snapshot and WAL archiving).
Teams: Database Reliability, Ops

The configuration for Patroni provisioning only allows a single region per cluster. There is networking infrastructure, Consul, and load balancers that need to be setup in the alternate region. We may consider setting up a "cascaded cluster" for the databases to improve recovery time for replication.

Redis

Dependencies: Improve Chef provisioning time by using preconfigured golden OS images (optional, but desired)
Teams: Ops

To provision Redis subnets need to be allocated in the alternate region with and end-to-end validation of the new deployments.

External front-end load balancing

Dependencies: HAProxy replacement, mostly likely GKE Gateway and Istio
Teams: Ops, Foundations

External front-end load balancing is necessary to validate the deployment in the alternate region. This requires both external and internal LBs for all front-end-services.

Monitoring

Dependencies: Eliminate X% Chef dependencies in Infra by moving infra away from Chef (migrate Prometheus infra to Kubernetes)
Teams: Scalability:Observability, Ops, Foundations

Setup an alternate ops Kubernetes cluster in a different region that is scaled down to zero replicas.

Runners

Dependencies: Improve Chef provisioning time by using preconfigured golden OS images (optional, but desired) Teams: Scalability:Practices, Ops, Foundations

Ensure quotas are set and align with us-east1 in the alternate region for both runner managers and ephemeral VMs. Setup and validate networking configuration with peering configuration.

Ops and Packaging

Dependencies: Create an HA Chef server configuration to avoid an outage for a single zone failure
Teams: Scalability:Practices, Ops, Foundations, Distribution

All image creation and packaging is done on a single VM, our operation tooling is also on a single VM. Both of these are single points of failures that have data stored locally. In the case of a regional outage, we would need to rebuild them from snapshot and lose about 4 hours of data.

The following are options to mitigate this risk:

Move our packaging jobs to ops.gitlab.net so we eliminate dev.gitlab.org as a single point of failure.
Use the Geo feature for ops.gitlab.net.