If you are considering Azure Files as the persistent storage for your Azure Kubernetes Service (AKS) applications, there are important considerations around AKS backup and recovery with implications on how you can perform Dev, Test, and Staging. This article outlines these data management considerations in detail and how to work around Azure Files limitation to achieve feature parity with using Azure Managed Disks.
Before we walk you through how to automate Azure Files restore in AKS snapshots, let’s quickly touch on some of the important underlying concepts:
A Persistent Volume (PV) is a storage resource created and managed by the Kubernetes API that can exist beyond the lifetime of an individual pod. This is often used in conjunction with the aptly named StatefulSets. In AKS, you can use Azure Disks or Azure Files to provide the Persistent Volume. The choice of Disks or Files is often determined by the need for concurrent access to the data from multiple nodes or pods (lean towards Files) or the need for a performance tier (lean towards Disks).
A Persistent Volume can be statically created by a cluster administrator, or dynamically created by the Kubernetes API server. Dynamic provisioning uses a StorageClass to identify what type of Azure storage needs to be created.
The Container Storage Interface (CSI) is a very useful standard for exposing arbitrary block and file storage systems to containerized workloads on Kubernetes. By using CSI, AKS now can write, deploy, and iterate plug-ins to expose new or improve existing storage systems in Kubernetes. Using CSI drivers in AKS avoids having to touch the core Kubernetes code and wait for its release cycles.
Starting in Kubernetes version 1.21, AKS uses CSI drivers only and by default. Existing in-tree volumes still work, but AKS internally routes all operations to a CSI driver. The default class is managed-csi. An explanation of each CSI storage class is sourced from AKS documentation below:
Uses Azure StandardSSD locally redundant storage (LRS) to create a Managed Disk. The reclaim policy ensures that the underlying Azure Disk is deleted when the persistent volume that used it is deleted. The storage class also configures the persistent volumes to be expandable, you just need to edit the persistent volume claim with the new size.
Uses Azure Premium locally redundant storage (LRS) to create a Managed Disk. The reclaim policy again ensures that the underlying Azure Disk is deleted when the persistent volume that used it is deleted. Similarly, this storage class allows for persistent volumes to be expanded.
Uses Azure Standard storage to create an Azure File Share. The reclaim policy ensures that the underlying Azure File Share is deleted when the persistent volume that used it is deleted.
Uses Azure Premium storage to create an Azure File Share. The reclaim policy ensures that the underlying Azure File Share is deleted when the persistent volume that used it is deleted.
Azure CSI Snapshots
Apart from provisioning PVs, a key functionality in CSI drivers is to enable snapshots and recoveries. Both Azure Managed Disks and Azure Files support snapshots of Persistent Volumes. However, not all CSI drivers are created equal. While they all support provisioning, several cloud and storage vendors do not support CSI for some of the more advanced functionality, such as resizing, recovery, and snapshots. Microsoft itself supports CSI drivers more extensively for Azure Managed Disks when compared to Azure Files.
Azure Files Snapshots
The Azure Files CSI driver supports creating snapshots of persistent volumes and the underlying file shares.
You can describe the snapshot by running the following command:
kubectl describe volumesnapshot azurefile-volume-snapshot
The output of the command resembles the following example screenshot, and you can see the “Ready to Use” flag set to true at the bottom of the response:
If the “kubectl describe volumesnapshot” command fails for any reason, you can see the “Ready to Use” flag set to False at the bottom of this response:
The issue with Azure File CSI driver that we ran into is that it does not support automated mounting or recovery from these AKS File snapshots. Restore from a snapshot is possible, but only if the restore is done manually from an Azure portal or CLI.
The following output shows the restore failure when it is attempted through the CSI driver:
The Azure support team graciously acknowledges this limitation, and they provided us guidance that this may be a hole they will not fix anytime soon. However, for AKS data protection applications like CloudCasa that use CSI snapshots, the manual workaround is too much of a headache in this automated world. Therefore we got to work to automate restore from Azure File snapshots.
Automating the Use of Azure Files Snapshots
CloudCasa launched native integration with AKS as part of the KubeCon + CloudNativeCon Europe 2022. This integration means that CloudCasa has a way to inventory Azure accounts to query a list of all AKS clusters and other dependent resources and catalog how each cluster is configured. This capability is necessary for two reasons:
- Compliance and centralized management: You can see and manage all AKS clusters across all users’ AKS accounts (test, dev, staging, different teams, etc).
- Advanced Recovery: By backing up the configuration of each AKS cluster, CloudCasa can recreate them on the fly during data migration or recovery. Users can also pick a different size of cluster and different auto-scaling parameters on these restored instances.
This level of integration cut down the amount of work we needed to automate Azure Files restores by a large amount. Now, we just needed to query the snapshotID of each Azure Files CSI snapshot and find the corresponding snapshot by directly querying the Azure Files API.
Once the snapshot is matched, we can now enable both copying of its contents to another region and more importantly recover it to a new volume, as and when needed. When performing copies, the CloudCasa data mover is able to verify that a block has already been transferred to the cloud repository and only backup incrementally changed blocks since the last transfer.
During recovery from snapshots, CloudCasa automatically identifies the snapshot that has the PV content a user is trying to recover. It then dynamically provisions a new PV through the Azure Files CSI driver and copies its contents through Azure Files Copy option. This operation runs as a pre-restore task before restoring resources that are potentially dependent on this PV. Both the automated copy and recovery operations can also be orchestrated through CloudCasa’s APIs.
For the visually inclined, here are a couple of screenshots from restore definition and the result of the restore job along with the job monitor stats.
Defining a backup of Azure Files data:
After a successful Azure Files snapshot:
Restoring an Azure Files snapshot to a new Namespace:
Swiftly Delivering the Fix with SaaS
As you can see, not all CSI drivers are created equal, and some providers even consider automated recovery a luxury. We appreciate the CloudCasa user who brought this Azure Files snapshot recovery pain point to our attention. It was a satisfactory experience to leverage the native integration that CloudCasa has with Azure to quickly address this problem of having to manually restore Azure Files PVs.
Further, given CloudCasa is a SaaS application, we were able to deploy the automated snapshot restore capability to Production within just 2 weeks of learning about this issue. The as-a-service model not only takes the infrastructure management hassles away from you, it also helps in delivering new features or capabilities like this faster.
If you are running into any other issues with Kubernetes PV snapshots and data recovery, we’d love to hear from you and help you. Give CloudCasa a try and Sign Up with our Free Plan.