KubeDR – Disaster Recovery for Kubernetes Clusters

Kubernetes has really taken the container world by storm and is fast becoming the de facto standard for deploying applications both on-prem and in the cloud. Kubernetes is a container orchestration system across a cluster of hosts. To manage a cluster of hosts and all the resources in the system (such as Pods and Services), Kubernetes provides REST APIs. One can perform CRUD operations on any resource supported by the APIs. Kubernetes stores all the cluster data including the resource specs in etcd – a distributed key-value store. So etcd is a key component in the Kubernetes architecture and it is very important to protect the data stored there which brings us to the main topic of this blog post.

There are many products out there that provide backup for applications running on Kubernetes or even for any resource in the cluster including persistent volumes. In the case of protecting resources, you need a backup of the resource specs (in YAML or JSON format). These backup products usually require you to set a well-defined label so that they can tell which resources the user wants backed up. There is a problem with this approach in that the backup doesn’t include *all*resources unless one took care to place the required label on all resources (or namespaces.) For this reason, if you need to rebuild the cluster, you may be missing some key resources. The only solution to capture *all* resources is by backing up etcd data directly.

Now, there are many articles and blog posts that talk about how to take etcd snapshot in a Kubernetes environment. They usually provide a Docker command that creates an etcd snapshot and may even provide a sequence of manual steps by which the snapshots can be saved as backups locally. But we know that manual steps lead to missteps and neglect. Moreover, a backup solution must offer more than taking a snapshot and saving the file. It should support features such as retention (keeping last N copies of backup and cleaning up older ones) and pausing/resuming.

Keeping all these things in mind, at Catalogic Software, we decided to implement this kind of backup for etcd data of a Kubernetes cluster as part of our cLabs initiative. We want to mark our entry into Kubernetes world by open sourcing the solution and giving back to the Kubernetes community we often benefit from.

KubeDR allows you to back up not only etcd data but certificates as well. The combination of etcd data and certificates allows you to rebuild the cluster. Here are some of the high-level features supported by KubeDR:

Backup cluster data in etcd to any S3 compatible storage.
Backup certificates
Pause and resume backups
Clean up older snapshots based on a retention setting.

For more details about the solution, please visit KubeDR@github. Please note that the project is in alphaphase, so we welcome you to report any discovery of corner cases we didn’t consider or to suggest new features in Github. We are actively working on adding monitoring and providing more restore options. In the current form, the solution works on all clusters where you have access to etcd. These include on-prem clusters as well as those in the cloud that are explicitly set up on the compute instances.

We look forward to working with the Kubernetes community to improve KubeDR.

Resources

Support