Home GitHub

Disclaimer: Mobb.ninja is not official Red Hat documentation - These guides may be experimental, proof of concept or early adoption. Officially supported documentation is available at https://docs.openshift.com.

ARO - Considerations for Diasaster Recovery

This is a high level overview of disaster recovery options for Azure Red Hat OpenShift. It is not a detailed design, but rather a starting point for a more detailed design.

What is Disaster Recovery (DR)

Disaster Recovery is an umbrella term that includes the following:

  1. Backup (and restore!)
  2. Failover (and failback!)
  3. High Availability
  4. Disaster Avoidence

The most important part of Disaster Recovery is the “Recovery”. Whatever your DR plan it must be tested and ideally performed on a semi-regular basis.

You can use RTO (Recovery Time Objective) and RPO (Recovery Point Objective) to help determine what level of DR is right for your company. These Objectives are often application dependent and may mean choosing full HA for one application, and Backup/Restore for another even if they’re both on the same OpenShift cluster.

Recovery Time Objective (RTO)

How long can your application be down without causing significant issues for your business. This can differ from application to application. Some applications may only affect internal staff while others may affect customers and revenue.

In general you will categorize your applications by priority and potential damage to the business and match your DR plans accordingly.

Recovery Point Objective (RPO)

How much data can you lose before significant damage is done to your business. The traditional backup strategy is Daily. If you can survive a loss of 24 hours of data, or you have an alternative way to restore that data then this is often good enough.

Combined RTO / RPO

When combined you will account for “how long can the application be offline” and “how much data can I lose”. If the answer zero or approaching zero for both then your DR strategy must be focussed around High Availabily and real time data replication.

Backup

In OpenShift it is not necessary to back up the cluster itself, but instead you back up the “active state” of your resources, any Persistent Volumes, and any backing services.

Azure provides documentation on the basic Backup and Restore of the applications running on your ARO cluster.

Azure also provides documentation on Backing up the various PaaS backing services that you may have connected to your applications such as Azure PostgreSQL.

Failover

An ARO cluster can be deployed into Multiple Availability Zones (AZs) in a single region. To protect your applications from region failure you must deploy your application into multiple ARO clusters across different regions. Here are some considerations:

Backup and Restore - Manual Cutover ( Hot / Cold )

Do you currently have the ability to do a point in time restore of Backups of your applications?

  1. Create a backup of your Kubernetes cluster

    If you restore these backups to a new cluster and manually cutover the DNS, will your applications be full functional?

  2. Create backups of any regionally co-located resources (like Redis, Postgres, etc.).

    Some Azure PaaS services such as Azure Container Registry can replicate to another region which may assist in performing backups or restore. This replication is often one way, therefore a new replication relationship must be created from the new region to another for the next DR event.

  3. If using DNS based failover, make sure TTLs are set to a suitable value.

  4. Determine if Non-regionally co-located resources (such as SaaS products) have appropriate failover plans and ensure that any special networking arrangements are available at the DR region.

Failover to an existing cluster in the DR region (Hot / Warm)

In a Hot / Warm situation the destination cluster should be similar to the the source cluster, but for financial reasons may be smaller, or be single AZ. If this is the case you may either run the DR cluster with lower expectations on performance and resiliance with the idea of failing back to the original cluster ASAP, or you will expand the DR cluster to match the original cluster and turn the original cluster into the next DR site.

Ideally Your applications and data should be replicated to the DR site and should be ready to switch over within a very short window.

High Availability ( Hot / Hot )

In a Hot / Hot scenario you have two-way synchronous replication of your data. The end user can access the application in either site and have the exact same experience.

ARO Specific Example

This following example will create two ARO clusters, each in its own Region. Virtual Network Peering is used to make it easier for resources to communicate for replication.

Create a Primary Cluster

  1. Set the following environment variables:

     AZR_RESOURCE_LOCATION=eastus
     AZR_RESOURCE_GROUP=ARO-DR-1
     AZR_CLUSTER=ARODR1
     AZR_PULL_SECRET=~/Downloads/pull-secret.txt
     NETWORK_SUBNET=10.0.0.0/20
     CONTROL_SUBNET=10.0.0.0/24
     MACHINE_SUBNET=10.0.1.0/24
     FIREWALL_SUBNET=10.0.2.0/24
     JUMPHOST_SUBNET=10.0.3.0/24
    
    
  2. Complete the rest of the step to create networks and cluster following the Private ARO cluster

Create a Secondary Cluster

  1. Set the following environment variables:

     AZR_RESOURCE_LOCATION=centralus
     AZR_RESOURCE_GROUP=ARO-DR-2
     AZR_CLUSTER=ARODR2
     AZR_PULL_SECRET=~/Downloads/pull-secret.txt
     NETWORK_SUBNET=10.1.0.0/20
     CONTROL_SUBNET=10.1.0.0/24
     MACHINE_SUBNET=10.1.1.0/24
     FIREWALL_SUBNET=10.1.2.0/24
     JUMPHOST_SUBNET=10.1.3.0/24
    
  2. Complete the rest of the step to create networks and cluster following the Private ARO cluster

Connect the clusters via Virtual Network Peering

Virtual network peering allows two Azure regions to connect to each other via a virtual network. Ideally you will use a Hub-Spoke topology and create appropriate firewalling in the Hub network but that is an excercise left for the reader and here we’re creating a simple open peering between the two networks.

  1. Get the ID of the two networks you created in the previous step.

     DR1_VNET=$(az network vnet show \
       --resource-group ARO-DR-1 \
       --name ARODR1-aro-vnet-eastus \
       --query id --out tsv)
     echo $DR1_VNET
    
     DR2_VNET=$(az network vnet show \
       --resource-group ARO-DR-2 \
       --name ARODR2-aro-vnet-centralus \
       --query id --out tsv)
     echo $DR2_VNET
    
  2. Create peering from the Primary network to the Secondary network.

     az network vnet peering create \
       --name primary-to-secondary \
       --resource-group ARO-DR-1 \
       --vnet-name ARODR1-aro-vnet-eastus \
       --remote-vnet $DR2_VNET \
       --allow-vnet-access
    
  3. Create peering from the Secondary network to the Primary network.

     az network vnet peering create \
       --name secondary-to-primary \
       --resource-group ARO-DR-2 \
       --vnet-name ARODR2-aro-vnet-centralus \
       --remote-vnet $DR1_VNET \
       --allow-vnet-access
    
  4. Verify that the Jump Host in the Primary region is able to reach the Jump Host in the Secondary region.

     ssh -i $HOME/.ssh/id_rsa aro@$JUMP_IP ping 10.1.3.4
    
     PING 10.1.3.4 (10.1.3.4) 56(84) bytes of data.
     64 bytes from 10.1.3.4: icmp_seq=1 ttl=64 time=23.8 ms
     64 bytes from 10.1.3.4: icmp_seq=2 ttl=64 time=23.10 ms
    
  5. ssh to jump host forwarding port 1337 as a socks proxy.

     ssh -D 1337 -C -i $HOME/.ssh/id_rsa aro@$JUMP_IP
    
  6. configure localhost:1337 as a socks proxy in your browser and access the two consoles.

From here the two clusters are visible to each other via their frontends. This means they can access eachother’s ingress endpoints, routes and Load Balancers, but not pod-to-pod. A PostgreSQL pod in the primary cluster could replicate to a PostgreSQL pod in the secondary cluster via a service of type LoadBalancer.

Cross Region Registry Replication

Openshift comes with a local registry that is used for local builds etc, but it is likely that you use a centralized registry for your own applications and images. Ensure that your registry supports replication to the DR region. Ensure that you understand if it supports active/active replication or if its a one way replication.

In a Hot/Warm scenario where you’ll only ever use the DR region as a backup to the primary region its likely okay for one-way replication to be used.

Example - Create a ACR in the Primary Region

  1. Create a new ACR in the primary region.

     az acr create --resource-group ARO-DR-1 \
       --name acrdr1 --sku Premium
    
  2. Log into and push an Image to the ACR.

     az acr login --name acrdr1
     docker pull mcr.microsoft.com/hello-world
     docker tag mcr.microsoft.com/hello-world acrdr1.azurecr.io/hello-world:v1
     docker push acrdr1.azurecr.io/hello-world:v1
    
  3. Replicate the registry to the DR2 region.

     az acr replication create --location centralus --registry acrdr1
    
  4. Wait a few moments and then check the replication status.

     az acr replication show --name centralus  --registry acrdr1 --query status
    

Red Hat Advanced Cluster Management

Advanced Cluster Management (ACM) is a set of tools that can be used to manage the lifecycle of multiple OpenShift clusters. ACM gives you a single view into your clusters and provides gitops style management of you workloads and also has compliance features.

You can run ACM from a central infrastructure (or your Primary DR) cluster and connect your ARO clusters to it.

Failover for Application Ingress

If you want to expose your Applications to the internet you can use Azure’s Front Door or Traffic Manager resources which you can use to fail the routing over to the DR site.

However if you are running private clusters your choices are a bit more limited.

Example using simple DNS:

  1. Create a new wildcard DNS record with a low TTL pointing to the Primary Cluster’s Ingress/Route ExternalIP in your private DNS zone. (in our case it was *.aro-dr.mobb.ninja)

  2. Modify the route for both apache examples to use the new wildcard DNS record.

  3. Test access

  4. Update the DNS record to point to the DR site’s Ingress/Route ExternalIP.

  5. Test access