Matt D's Blog

Introduction

In updating the website infrastructure to be more modern, I figured it would overall be simple - and it was. Put each item into a container, and then run it under an Orchestrator (Nomad) for…reasons. But, like all good plans, the devil was in the details. All was fine when it was one node. Then I thought, well, why not make it a 3 node cluster, providing some resiliency and allowing the underlying Orchestration and service discovery components to work in a cluster like they are designed to. So that works great…except how do I make sure the data I’m referring to (like this website) is the same on all of the nodes?

Contenders

The major contenders out there for shared container storage are:

CEPH
Portworx
iSCSI + automation (many vendors)
Cloud specific volume drivers (AWS / Azure / GCP volumes)

There are others such as Rancher’s Longhorn, but while that is well integrated into Kubernetes, running with just plain Docker volume support does not seem to be there. As I’m running my containers on Nomad, on virtual machines not in the cloud, this eliminates the last option. Automated provisioning of iSCSI volumes are a good option if your vendor has support for Kubernetes or at least Docker.

I also didn’t look into OpenEBS, but that may have been a mistake. We’ll see.

CEPH

CEPH makes the most sense here, and is what I wanted to go with, but alas, that was not in the cards.

Previously the back-end to CEPH had a FileStore option as well as BlueStore, which is now the default. This works great as long as you have enough RAM. But, the nodes I am working with only have 2 GBs of RAM, which as the requirements warn you ahead of time - is not enough. The main difference between FileStore and BlueStore is that FileStore is built upon standard filesystems like ext4 and xfs, where BlueStore removes that layer from the mix. The downside is that BlueStore also has its own cache - which wants 2GB of RAM by default. 2GB - 2GB…is zero for anything else. So you can see how that might be an issue. In theory you can change the cache size setting, but in the end, the OSD (the storage service) kept crashing, and it was just not a good experience.

Pros:

Using cephadm it was pretty simple to get it going
Full stack setup including monitoring / web control out of the box

Cons:

Memory usage
cephadm very opaque as to what it was going to do

If I had beefier machines to work with, this is likely where I would have stopped, as it supports both shared file systems and block storage.

Portworx

I stumbled upon Portworx via the Hashicorp learn site when looking at all of the options available to run items which require state. The installation was relatively easy, but I missed two very important pieces:

The cluster ID in the command needed to be replaced with the cluster ID from the Portworx licensing site.
There’s no easy way to actually load a Portworx Essentials license on a non-kubernetes host

As a result, nothing else I would write here would matter, as the cost to license this software is astronomical - $0.55/machine*hours. My instances are running 24x7, which is ~720 hours/month. 3 machines. Yeah, no. So this software could be the best…which it’s OK, but it’s repackaged btrfs + networking magic + an NFS server. It better be making me breakfast for the pricing they want. Next.

LINSTOR

I had seen this a few times, because of my previous work with another product from LINBIT - DRBD which I’ve used in the past at work with a large amount of success to keep nodes in sync, and for migrating virtual machines between hosts without shared storage.

LINSTOR basically takes their DRBD product and adds automation on top of it to work with anything - it provides a REST API, a reasonable CLI, and integration with both a Docker driver and a CSI driver for Kubernetes. It’s got a pretty simple architecture of a controller which tell satellite boxes what to do. What’s nice is you don’t have to have storage on the nodes - you can mount any of the resources remotely over the network only.

I ran through the installation, cleaned off the data drive, and added it to the pool. Everything looked great - until I tried to create a volume with Docker. It wouldn’t work. But this was again self-inflicted wounds as a result of me not paying attention to the warning that the node name I gave it didn’t match the host name. This was because I was using WireGuard tunnels between the nodes, and had used those hostnames as the node references. The correct way to do this was to add the node with the actual node hostname, and the desired IP address to be used as an additional parameter.

Once that was done, we were in business. I moved the Commento DB to a volume, which worked great. I then tried to move a MongDB volume and kept getting strange errors such as:

Error message:        Invalid name: Name length 64 is greater than maximum length 48

Error context:
    The specified resource name '9f54e6561fe90a90b34331f5b988a9a2f42f5ccbf2ef8d99e65f0e58b74a6bd4' is invalid.

Which… I have no idea what caused that. I did, however, find a solution. I tested the same volume using a different container image (ubuntu:latest) and it mounted and came up just fine. So it was something unique to the container. I brought up a new empty container, exported and re-imported it so that it would flatten, and it works just fine with that image. No idea why, but not my problem right now. :) I fought with that for 2 days, so I hope to not have to revisit that any time soon.

The only other problem is that if the controller is rebooted, the satellites need to be restarted for it to pick up on their state for some reason. This might be because of my WireGuard settings, and I will be adding keepalive there and seeing if that makes this go away.

Additionally, shared (RWX) storage is left to the user to implement via a gateway. For now I may just manually sync the nodes for this part. One problem at a time. :)

So for now, I’m mostly happy with LINSTOR, appreciate it being fully open source, and with a foundation that’s been around a long time, and works.