Turtle Systems: How we overhauled our infrastructure, consolidated hardware using xcp-ng, terraform, and open source software -

Introduction

In the span of about 2.5 years we’ve massively overhauled our infrastructure leveraging high availability concepts afforded to us by xcp-ng and TrueNAS. As of this writing our entire networking stack is powered by XCP-ng including our border routers that are routing 2 x 10Gbps IP Transit circuits and 2 x 1Gbps Out of Band IP Transit circuits with BGP to our ISP, and OSPF routing on the rest of the networking fleet. Finally, we’re handing off VRRP floating IP gateways to the hosts where required, and firewalling+performing NAT for physical infrastructure and virtual infrastructure alike. iPerf testing using public iPerf servers yields us 10gbps wire speed testing to various endpoints on the Internet. All of this only top of rack switches in each cabinet to hold the XCP-ng pools together and a physical place to attach the IP transit circuits.

All of this is highly replaceable/reproducable as we have utilized the Xen Orchestra Terraform Provider with a templated VM image and cloud-init for almost everything. We prepare for disaster recovery by using delta backups with Xen Orchestra to the opposing or tertiary NAS.

Background

We’ve been Xen, later XenServer, and finally xcp-ng faithful since before the fork (https://docs.xcp-ng.org/project/about/)

In the days of our beginnings we had a simple setup: A singular hypervisor managing VMs locally. We grew a little bit and following that we moved to employ a NetApp FAS270, and later some NetApp 3070s as iSCSI disks for Xen, later xcp-ng. The after-market NetApps served us well. We had traditional Cisco catalyst switches gluing it all together and the entire works routed by a pair of medium-duty Cisco routers.

We started off as open source consumers, later becoming paying Xen Orchestra customers, and finally paid support customers of Vates.

Time went on and in late 2022 where we realized it was time to upgrade again.

However, we decided to embark on a paradigm shift: We could at the very least build our own storage subsystem with the debut of TrueNAS, and with the advances in the top-of-rack switchgear over the years perhaps more things.

Why?

With 25+ years in tech we’ve come to know two classifications of hardware arrangements.

You either:

– Deal in warranties

– Deal in spares

Sometimes it’s both, but the decision to do one or the other usually comes down to budget or cost.

However, in both of these arrangements you require replacement from cold spare whether it is a vendor sending you the part or you taking it off the shelf.

In both cases you make best effort to have a hot spare or a quick recovery path in the form of high availability.

We’re small. We deal in spares. But we believe in both careful selection of hardware and maximum robustness with a high availability strategy. We’ve spent time in Fortune 500 organizations with unlimited budget and small startups with no budget.

I. The Hardware Selection

1. The switchgear

After some research we decided that Dell’s S4048-ON made an excellent commodity switch. The reviews were very positive. Redundant hot swappable power supplies. Redundant hot swappable fan assemblies. This was for all intents and purposes a re-skinned Force10 switch with the capability to run whatever-OS-you-wanted on it. We opted to stay with the Dell/Force10 OS 9 (later 10). It is a 48 port SFP+ 1/10G and a 6x40Gps QSFP switch. We’ve installed 2 top-of-rack in each cabinet.

It provides a reasonable routing feature set offering ways to configure BGP, OSPF, VRRP, etc. Those last two are important as you will read later.

2. The NAS/Storage Subsystem

A lot of work went into this. A flaw of ours was expecting like-for-like feature parity with the NetApps we were leaving behind (clustering, etc). Our NetApp setup had two unique “filer heads” which were jointly cabled to a set of DS4243 disk shelves. The NetApp wizardry kept it all together but there was not much in the way of diret replacement of this. Once we changed our thinking, it cleared the headspace to focus on a new methodology of fault tolerance.

We thought the SAS disk was the answer and we knew that we were to use TrueNAS so we did not need any sort of elaborate RAID setup.

The selection process brought us to the Dell PowerEdge 730xd filled with Seagate Exos 18TB 12Gbps SAS disks. We specc’d an L2ARC Cache of an NVMe drive attached to a PCI Express card.

We were coming from 2 x NetApp 3070s so these were the “01” and “02” counterparts. But as you’ll read further on down the line each one became a ‘home’ SR for a given pool.

Our biggest cost here were the drives, retailing for north of $300 a piece. The hardware including a next-day parts warranty was sub $2000 per unit refurbished from our choice IT reseller.

3. The Compute (the xcp-ng hypervisors)

We’ve had a longstanding tradition of leaning on IT resellers to purchase last-generation Enterprise-class hardware. This brought us to the HP DL360 G9. We loaded each one up with 24x32GB DIMMs fopr 768GB per hypervisor. Each one was fitted with a 2 x 10Gbps NIC in the FlexibleLOM port and a locally configured RAID for fault tolerance of the OS Drive (later the DR SR — see below)

4. The Routers and Internet

This was still an area where we had not solved the puzzle. The answer would not come to us readily until after several iterations of trial and error. The results are phenominal as you will read later.

Until this point we had NAT networks operating from our pair of Cisco ASR1002-X. We did not have a proper firewall, instead leaning on software firewalls on all hosts (iptables, etc) and access-lists on the Cisco for basic packet filtering. We utilized Cisco’s HSRP for redundancy down at the host level giving them a floating default gateway. As we were moving away from Cisco entirely, HSRP was no longer available. VRRP was, however.

At this point we provided service to our customers via a pair of 1Gbps IP Transit Circuits but had budget to upgrade to 10Gbps circuits once the project was near completion.

II. The implementation

The desire was to create a mesh where possible. Utilizing Linux bonding (which both TrueNAS and xcp-ng support natively) we are able to have a NIC installed in a device that can connect to both top-of-rack-switches.

This would achieve a level of high availability as well as provide us with the ability to achieve a 20Gbps bandwidth path if necessary.

Some poor decisions in migrating our XCP-ng pool to new hardware gave us pause about what we were doing. If you have experience with xcp-ng you would know this as “trying to add pool members that have different NIC configurations”. The results were spectacularly disasterous. Back to the drawing board.

For years we struggled with “how to validate production workloads” when we considered moving up xcp-ng versions. This was no exception.

The decision was to create two equal clusters

(X)cp (P)rimary and (X)cp (S)econdry or “XP” and “XS” for short. We also designated each of the two new TrueNAS machinesd as nas-xp and nas-xs respectively. If the resource was “homed” to one of the pools then its VDI also lived on that NAS. Without exception.

The original designation and allocation was:

– All production workload would be in the XP pool

– The secondary pool would exist as a life boat, a staging area, and a place to house secondary workloads that did not need the highest amount of reliability we provide to revenue-generating service.

This strategy allowed us to, at any time for any reason migrate the VMs to secondary cluster for maintenance work on the compute or the NAS/SR backend. With this model we could (and did) upgrade xcp-ng and TrueNAS by live migrating the production workload without downtime. We also ran a quasi ‘active/active’ configuration as our internal workloads were primaried to the XS cluster away from the revenue service workloads.

III. Integration using Terraform

For many years we were very much a “boot the VM off the ISO and use the installer” shop. We had a sizeable ISO Repo and we would continually update that. From Windows, to FreeBSD, to many shades of Linux bootstrapping a new machine meant booting it off the ISO and running it through the install. Over the years we got some things going with Ansible but there was still a lot left to be desired there. Building new VMs was toil. We did not have a lot of churn so this was not an end-all but it made for a very inconsistent fleet.

Every time we had to do an OS lift it was a chore. We had very little replacability beyond

IV. Routers and firewalls in xcp-ng

First, let’s clear the air and stop pointing fingers.

Multicast works in xcp-ng. Full stop. I will explain why a little later on. Do not let anyone in the pfSense, OPNSense, or FreeBSD communities tell you otherwise. If you want a demonstration I can show you with keepalived and OSPF in about ten minutes.

If you are here because you are running xcp-ng and trying to deploy HA with CARP let me stop you right there.

After searching hundreds of websites, talking through things on the xcp-ng Discord and even talking to Netgate (We took a pfsense+ license) the finger pointing was explosive.

Here’s the real truth (and you can find it in a discussion between me and another xcp-ng user on Discord):

*If you are running a bond with xcp-ng CARP does not work*. Not across pools, not across hypervisors, not even on the same hypervisor reliabily. CARP unicast is also very, very inconsistent in our own testing.

If the underlying interface on top of which you want to use your VMS and CARP is a bond, *CARP will not work*

This is not an xcp-ng limitation. Do not file a bug. This is due to the way Linux bonds work and how the present themselves to the hardware they are connected with. Note that our testing was limited only to the Linux bond. We did not do extensive testing with LACP although we suspect that the issue is with the design of the bond system moreso than the type of bond.

You can use a dedicated pif that is not bonded for this and we had reasonable success with that. But having a NIC dedicated specifically for pfsense across the cluster was just not in the cards for us.

NOTE THAT THIS IS TRUE FOR ANY MULTICAST APPLICATION ON A FREEBSD GUEST. This includes OSPF and pfSense’s HA mechanism that uses multicast to maintain state.

It took us many outages and several months to figure that out.

V. Disaster Recovery

We came from NetApp. With their fancy dual parity