Xen + AoE + drbd = New Redundant Hotness

A few weeks ago, my buddy Justin posed an interesting problem, one that I’ve been pondering myself for some time. He’s somewhat of a Xen zealot like myself, and is doing some Xen setups that are similar in construction to mine, with a central shared storage array and two or more dom0 machines where the child instances will live. The prospect of migrating domUs between dom0s is quite appealing to him, but he, like myself, realized a critical flaw in the setup. If the storage array fails or requires uptime-affecting maintenance of some sort, the whole setup grinds to a halt. That doesn’t really fit the goals he and I are both after.

After a bit of thought, I looked to a project Justin had mentioned a few months back called drbd. It’s designers deem it “network raid 1”, and that’s a pretty accurate description. It’s essentially a system that mirrors data between two different machines either in an active-standby or active-active configuration. One of its primary goals is to provide storage as close to 100% of the time as is possible. Its usefulness would vary highly depending on the application. Having a normal file system on it shared between two machines could be nothing short of a nightmare, since neither server knows what the other is doing until changes are already written to the shared storage. A clustered file system would work well with it though. As I began to learn more about how it works, I realized it could potentially be a great solution for my predicament. Since either of of the two machines could provide storage at any given time, it would have no problem fufilling the near 100% uptime requirement.

What really makes the solution stand out to me isn’t just drbd itself, but the combination of drbd and AoE. AoE is, by design, a connectionless protocol. When the kernel module is loaded, it does a device discovery to see what devices are available for its use, and listens thereafter for new devices. The information it learns is pretty much limited to a MAC address where the storage device is located and the vblade “addresses” within that device that are available. There’s nothing within the protocol that outlaws multiple targets from advertising the same vblade “address”, and it’s up to the AoE initiator in the kernel module to choose where it’s sending data. Because of this, you could have two linux vblade targets running on both “ends” of a drbd setup, and there’d be no conflicts whatsoever. The recommended setup in drbd is to consider a write operation as finished only when data has been written to disk on both “ends” of the drbd setup. Combine that with the fact that AoE will only send commands to one MAC address at a time, and its pretty much guaranteed that both vblade targets will be connected to the same data at all times, even though they’re on different machines. I can think of a scenario or two where data would be out of sync, but it would require that disk write operations be done in parallel, and I’m farily certain that they aren’t.

The fact that the same data is on both machines and that AoE allows for a quick and painless transfer between vblade targets is what makes this such a simple and effective solution for me. There may be a few seconds of lag while the AoE initiator realizes that the machine it was talking to has disappeared, but that will pass as soon as it does another device discovery and sees the other vblade target. This is perfectly acceptable in my usage scenarios thus far.

I took the plunge a week or so ago and started converting my storage at home to use drbds. It’s was pretty simple to convert from my LVM-based setup, since all I had to do was create another single LVM for every partition I wanted to sync between machines. These additional LVM partitions store the metadata that drbd uses to track changes and to keep things in sync. This configuration also allows me to revert back to using “naked” LVM partitions as my vblade storage targets if I decide I don’t like drbd in the future. I used my MythTV recording backend as the second drbd server, since it has a lot of space for extra drives and is on pretty much all the time. I put in a 120GB drive, and let everything fly. Once the initial synchronization was complete, I did a few tests, and everything worked as intended. I could kill vblade targets on either machine, and after a few seconds, the initiators would look at the other machine and use it for storage. Success!

As of this weekend I’ve also converted my setup at work to use the same general configuration. The primary storage consists of a big RAID array, with a secondary machine using a single drive as a backup. I figure that in most cases running in an active-active setup wouldn’t be necessary, so I’m going to stick with active-standby, and only start the vblade targets on the secondary machine when I’m planning on a reboot or other maintenance event. I’ve also considered running in an active-off state (with periodic resyncs), so that there wouldn’t be any performance hit from waiting for the second server to complete its writes. This would probably be a less desirable setup since the data could (and very likely would) be out of date if I were to suffer an unplanned outage such as a hardware failure. Nothing I run currently is terribly needy in terms of disk write performance, so I’m not terribly concerned at this point.

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>