Persistent storage in Kubernetes

When I first heard of K8s few years back, storing persistent data on top of it was as forbidden topic, and something you shouldn’t really do. Things have evolved since then, technology improved, but mostly a need for persistent storage increased as we migrated more and more apps to the K8s.

Storing persistent data in applications can be achieved in a myriad of ways, and with the advent of microservices, emphasis has been on moving application state outside of the application itself (and its local filesystem directory) and to use some external service such as:

external NFS share
object storage bucket
database (SQL, in-memory, whatever.)

As a side-effect of moving data outside of the application we can deploy it by simply deleting it and replacing it with a new build/revision. Also, any configuration we might want change can be specified by setting certain environment variables to the process. As a Mandalorian would say: This is the way.

That all sounds nice and clean, but things aren’t always that way. There are times when you want to keep some application data or state, and then the challenge becomes on how to move it outside of the ephemeral container or how to make that container less ephemeral. For example, you might want persistence when:

you decide to run database server in containers on top of the Kubernetes. You have to store dataset somewhere.
you have some legacy application that requires local storage for uploads or keeping some other data (or performs better in such scenario)

I remembered those things in particular as I’ve been thinking about hosting Nextcloud on top of the K8s but there area likely other reasons as well. Setup of the Nextcloud will be a topic for some future article.

In any case, Kubernetes, as a huge API, has a solution for all kinds of scenarios when you’d want to persist some data. Components involved in the usual data life-cycle management are:

StorageClass - is a thing you specify within the cluster (global resource) that uses Storage API within the K8s and is used for configuring volumes by communicating with its provisioner (or driver if you wish), manages parameters of your volumes, and decides what happens with the volume data once you delete it (reclaimPolicy)
PersistentVolume - is a volume created by the/in some StorageClass. There you define which StorageClass to use, capacity of the volume but you can also specify access modes and reclaim policy. Depending on the Storage Class you might have an option to specify additional parameters such as number of the volume replicas and whatever else might make sense to the underlying driver.
PersistentVolumeClaim - you specify it in order to bind certain volume and expose it within a Namespace for some Pod to use.
volumeMount - that option within the statefulset, deployment, pod or whatever that maps PersistentVolumeClaim to a certain path within the container.

These things are a bit more complex than what I described of course. I included minimal explanation just to get you through this article. If you have more experience and a better way to explain those components, don’t hesitate to share them 🙂

Volume driver choice

There are many CSI plugins available to use for provisioning volumes, and in my humble opinion, it would appear that the major cloud providers have figured it out first. They simply provide a way to create a volume on their storage platform (be it EBS on AWS, Azure Storage, or something else) and map it to the appropriate compute runtime (Fargate container, EC2 instance, etc.). There they already have tried and true methods for managing volumes and filesystems.

But I run locally, on bare metal, so I had to come up with a different solution for my use-case, and I imagine many folks that decide to take the same route (on-premises K8s) will need to make a similar decision.

When thinking about which driver to use there were multiple candidates. Some of them were too complex for me to start with without firstly properly understanding most components of the storage system, some of them seemed like an overkill, and some of them appeared as an ’non-redundant’ option.

Although I don’t have some strict requirements for redundancy at this point (since the whole setup is STILL RUNNING ON A SINGLE MACHINE) it would be nice to lay the groundwork for adding new machines and have an automatic redundancy from that point, without much hassle or reconfiguration (or God forbids, redeploying the whole thing and recreating many volumes I might “inject into the cluster” until then).

I’ve been getting various recommendations but in the end I decided to start a journey with Longhorn. Other options that I plan on trying down the road in no particular order are:

External NFS mount - simple, yet effective. For scenarios where I might want to store huge files on some external drive (eg. Nextcloud volume for keeping my “Archives” folder)
Synology CSI - since I have a Synology server now (thanks Andrei) I might want to play a bit with their own CSI driver and see how it works and what can be achieved with it
OpenEBS with ZFS - was a huge contender to the Longhorn when I was looking at an initial storage solution, but it seemed to complex to start with.
DirectPV by Minio - similar to OpenEBS with ZFS, and thus seemed like too much work to start with, and also at first glance didn’t seem to have some replication options built-in, but it could be that I haven’t carefully read the documentation.
Rook (Ceph) - was definitely an overkill to manage at this point. It is an overkill on bare-metal machines with 20 servers in a cluster, so naturally it is on 1 non-redundant machine as well. At this point I would probably fight more with faking redundancy, configuring networking etc. instead of getting something running.
Minio - I might try this one next, but for start I needed block/filesystem storage, not an Object Storage, because I want to run database on top of the cluster.

Configuring Longhorn

The whole process was pretty straight-forward, mostly using their quick-start guide

Installing prerequisites

There are some prerequisites I had to install on the host machine in order to enable Longhorn provisioning:

apt update
apt install jq open-iscsi nfs-common

Install Longhorn

To install Longhorn, as I didn’t need any special configuration (playing around with it) I went with plain Quick Start default:

helm repo add longhorn https://charts.longhorn.io
helm repo update

helm install longhorn/longhorn --name longhorn --namespace longhorn-system --version 1.5.1
kubectl -n longhorn-system get pod

Ingress configuration

Since Longhorn has an UI for managing things, I had to expose it via Ingress Controller.

Prior to configuring Ingress, I created a new secret in order to use those credentials for Basic Auth:

kubectl -n longhorn-system create secret generic longhorn-basic-auth --from-literal=MYSUPERUSERNAME=$(echo 'MYSUPERPASSWORD' | openssl passwd -stdin)

Since I’m using HAProxy as an Ingress Controller, I had to configure it in a following manner:

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    haproxy.org/auth-realm: "Longhorn Admin"
    haproxy.org/auth-type: basic-auth
    haproxy.org/auth-secret: longhorn-basic-auth
  name: ingress-longhorn
  namespace: longhorn-system
spec:
  ingressClassName: haproxy
  rules:
  - host: mylonghornsubdomain.MYDOMAIN
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: longhorn-frontend
            port:
              number: 80

As you can see, to prevent unauthorized access to the Longhorn management UI, I enabled Basic Auth which is good enough for this purpose.

And that’s about it. As you can see we now have a working StorageClass which we can use to provision our volumes.

For example, I might use following volumeClaimTemplate in the StatefulSet in order to get volume automatically provisioned:

  volumeClaimTemplates:
    - metadata:
        name: ednevnik-data
      spec:
        accessModes: ["ReadWriteOnce"]
        volumeMode: Filesystem
        resources:
          requests:
            storage: 512Mi

Or I might use PersistentVolumeClaim resource to provision the volume:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dns-zone-current-ip-volume
  namespace: misc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Mi

This will instruct the default StorageClass, Longhorn, to provision a new volume and attach it to this PVC, which is in turn attached/mounted within the pod/container if we defined so.

Configuring backups

Besides replication options, Longhorn has an ability to configure backups of the volumes. You have a choice of using CIFS, NFS or S3-compatible storage as a backup destination. For simplicity sake (and because it is cheap) I created a new S3 bucket on my AWS account and configured S3 based backup.

There is probably a way to do it via various yaml definitions and resources, but I instead used UI. When you go into Settings in the UI, there’s an option to configure S3 endpoint where you’d like to send your volume backups to. It has to be configured in the following format s3://BUCKETNAME@REGION/. Immediately below it, there’s an option to specify name of the secret used for authenticating to the endpoint.

For S3 you might want to use something like the following:

apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: longhorn-s3-backup-secret
  namespace: longhorn-system
data:
  AWS_ACCESS_KEY_ID: MYKEY
  AWS_SECRET_ACCESS_KEY: MYSECRET

If you’re sure everything is configured properly, but for some reason it is not working (Backup page throws error that auth has failed, but doesn’t give proper explanation), you might want to try rollout restart the longhorn-ui deployment before spending hours troubleshooting the issue any further:

kubectl -n longhorn-system rollout restart deployment longhorn-ui

This did trick for me (after I looked at, and double-checked, various things 30 times 😏).

Unknowns and issues

Node running this setup doesn’t have too big of a drive (120GB NVMe), so the space can fill-up quickly, and with its conservative limits and space reservations, Longhorn can quickly make Node “unscheduleable”, thus blocking further volume provisioning. Or even block writes to the volumes, breaking things here and there. This is what you get when you go into a storage business before properly understanding things to the last detail. But oh well, we live and learn. You might be smarter than me and adjust those limits on the node before any issues occur. My recommendation is to set it slightly higher than your usual monitoring threshold (if I alert disk space usage at 90%, I don’t want Longhorn to break things and cause issues if the disk is at 70%).

Then there I have some unknowns in the volume restore process. For example, I was unable to perform a restore action if the volume was still there, instead, I had to first remove it, to be able to restore it to its original name. Alternative would probably be to restore it under some different name and adjust the mapping within the PVC and the workload. Performing basic tasks like this is very tedious and weird for me, but that is probably since I’m not closely familiar with the Longhorn internals.

Speaking of volume backups and restores, after restoring the volume I have trouble setting up recurring backups on that volume again. Automated backups used to work just fine, but now they’re failing with:

time="2023-10-30T00:32:32Z" level=error msg="Failed to run job for volume" concurrent=3 error="failed to complete backupAndCleanup for pvc-01de9f5b-b319-4676-bd3b-d7a3f23d99b8: timeouted waiting for the the snapshot daily-ba-c1c35ea7-635e-40bd-87cc-1c69d840f0f6 of volume pvc-01de9f5b-b319-4676-bd3b-d7a3f23d99b8 to be ready" groups=default job=daily-backup labels="{\"RecurringJob\":\"daily-backup\"}" retain=10 task=backup volume=pvc-01de9f5b-b319-4676-bd3b-d7a3f23d99b8

What’s even more weird is that manual snapshot of the volume works just fine I’m trying other things on K8s in parallel as I’m troubleshooting this so it doesn’t have highest priority as it probably should have, but once I get to the bottom of it I’ll make sure to publish an update or a new article on that particular topic. It is probably just some silly thing I forgot to do. If in the meantime you’ve seen exactly this behavior and managed to resolve it, feel free to share your ideas.

Conclusion

Aside of the self-inflicted quirks and lack of understanding of the system, Longhorn is pretty nice solution for persistent storage on K8s. It can handle volume replicas for you and thus ensure redundancy. One can think of it as simplified Ceph. I’m still waiting to experience some real issues with it, and perhaps my opinion on the system will change, but until then, it’s passable.

Performance is something I haven’t tested, and I can bet that it isn’t great as compared to the direct storage on the disk. But, there is an v2 data engine which is in preview state and will allow you to manage block devices directly instead of mounting filesystem directories, but I’ll probably play around that one more on that once it reaches GA.

Next steps

My next step now is to resolve Longhorn issues I have, and then focus on provisioning database on top of it. For that purpose I narrowed down my choice to the Percona Operator for PostgreSQL, because, well, Percona folks are awesome. In the meantime, I’m also reading a bunch of various docs, listening to other folks opinions, evaluating options in my head.

One such recent discovery was the fact that the Ingress will soon be superseded by the Gateway API, so I might play with that on HAProxy Ingress Controller a bit as well.

Volume driver choice#

Configuring Longhorn#

Installing prerequisites#

Install Longhorn#

Ingress configuration#

Configuring backups#

Unknowns and issues#

Conclusion#

Next steps#