Mike Neir's Page[ignignokt][err]
Mike can finally have Eve schizophrenia. Fun! (8 days ago)
Viewing 3 posts tagged with 'lvm'
(Oldest First :: Newest First)
Show related: tags

Xen + AoE + drbd = New Redundant Hotness

Tuesday, July 22 2008, 2:17 AM

A few weeks ago, my buddy Justin posed an interesting problem, one that I've been pondering myself for some time. He's somewhat of a Xen zealot like myself, and is doing some Xen setups that are similar in construction to mine, with a central shared storage array and two or more dom0 machines where the child instances will live. The prospect of migrating domUs between dom0s is quite appealing to him, but he, like myself, realized a critical flaw in the setup. If the storage array fails or requires uptime-affecting maintenance of some sort, the whole setup grinds to a halt. That doesn't really fit the goals he and I are both after.

After a bit of thought, I looked to a project Justin had mentioned a few months back called drbd. It's designers deem it "network raid 1", and that's a pretty accurate description. It's essentially a system that mirrors data between two different machines either in an active-standby or active-active configuration. One of its primary goals is to provide storage as close to 100% of the time as is possible. Its usefulness would vary highly depending on the application. Having a normal file system on it shared between two machines could be nothing short of a nightmare, since neither server knows what the other is doing until changes are already written to the shared storage. A clustered file system would work well with it though. As I began to learn more about how it works, I realized it could potentially be a great solution for my predicament. Since either of of the two machines could provide storage at any given time, it would have no problem fufilling the near 100% uptime requirement.

What really makes the solution stand out to me isn't just drbd itself, but the combination of drbd and AoE. AoE is, by design, a connectionless protocol. When the kernel module is loaded, it does a device discovery to see what devices are available for its use, and listens thereafter for new devices. The information it learns is pretty much limited to a MAC address where the storage device is located and the vblade "addresses" within that device that are available. There's nothing within the protocol that outlaws multiple targets from advertising the same vblade "address", and it's up to the AoE initiator in the kernel module to choose where it's sending data. Because of this, you could have two linux vblade targets running on both "ends" of a drbd setup, and there'd be no conflicts whatsoever. The recommended setup in drbd is to consider a write operation as finished only when data has been written to disk on both "ends" of the drbd setup. Combine that with the fact that AoE will only send commands to one MAC address at a time, and its pretty much guaranteed that both vblade targets will be connected to the same data at all times, even though they're on different machines. I can think of a scenario or two where data would be out of sync, but it would require that disk write operations be done in parallel, and I'm farily certain that they aren't.

The fact that the same data is on both machines and that AoE allows for a quick and painless transfer between vblade targets is what makes this such a simple and effective solution for me. There may be a few seconds of lag while the AoE initiator realizes that the machine it was talking to has disappeared, but that will pass as soon as it does another device discovery and sees the other vblade target. This is perfectly acceptable in my usage scenarios thus far.

I took the plunge a week or so ago and started converting my storage at home to use drbds. It's was pretty simple to convert from my LVM-based setup, since all I had to do was create another single LVM for every partition I wanted to sync between machines. These additional LVM partitions store the metadata that drbd uses to track changes and to keep things in sync. This configuration also allows me to revert back to using "naked" LVM partitions as my vblade storage targets if I decide I don't like drbd in the future. I used my MythTV recording backend as the second drbd server, since it has a lot of space for extra drives and is on pretty much all the time. I put in a 120GB drive, and let everything fly. Once the initial synchronization was complete, I did a few tests, and everything worked as intended. I could kill vblade targets on either machine, and after a few seconds, the initiators would look at the other machine and use it for storage. Success!

As of this weekend I've also converted my setup at work to use the same general configuration. The primary storage consists of a big RAID array, with a secondary machine using a single drive as a backup. I figure that in most cases running in an active-active setup wouldn't be necessary, so I'm going to stick with active-standby, and only start the vblade targets on the secondary machine when I'm planning on a reboot or other maintenance event. I've also considered running in an active-off state (with periodic resyncs), so that there wouldn't be any performance hit from waiting for the second server to complete its writes. This would probably be a less desirable setup since the data could (and very likely would) be out of date if I were to suffer an unplanned outage such as a hardware failure. Nothing I run currently is terribly needy in terms of disk write performance, so I'm not terribly concerned at this point.

Tags:

More Fun With AoE

Tuesday, October 23 2007, 12:54 AM

In another instance of me finding a solution to an imaginary problem, I've succeed in creating a diskless workstation that uses an AoE-exported LVM partition as its root filesystem. I've been wanting to try it out for a while, and this weekend became the time. The setup turned out to be pretty easy once I got past some of the technical hurdles. I didn't do a complete install over AoE this time around, but I've proven it's at least possible to make to make it go using an image made on another machine.

The setup I used is similar to what I've got going with my Xen + AoE setup on my colo boxes at work. I've got a LVM partition that houses the files, and a vblade server exporting that LVM partition over AoE. Out of laziness, I just copied the Fedora7 install from my linux workstation into the LVM partition to provide something I could try booting.

The client side of AoE is provided by a kernel module, so it's not hard to get things working in normal circumstances. Getting it to work on boot was a bit more of a chore, but was pretty easy once I got past the stumbling blocks. I used gPXE/Etherboot on a CD-ROM to get things into a network-bootable state since the onboard network boot stuff on the machine I used is crap and doesn't work. From there, I loaded pxegrub, which then grabbed the kernel and initrd image from my file server via TFTP. I've been booting my MythTV frontend in a similar manner for some time now, only using NFS as the root filesystem.

Getting the initrd right was the biggest hurdle in the whole endeavor. Using mkinitrd I was able to get the AoE and NIC drivers to load without issue. I knew I would have to modify the initrd to include the proper device nodes so that the init script could communicate with the AoE module and see the block devices used to tie into the export, but it wouldn't see the export no matter what I did. Some hours later and after the creation of many initrd files, it occured to me that the network interface was never brought up into an active state. The only way I know to bring up an interface is with ifconfig, so I copied that into the initrd along with the shared libraries it needed. After modifying the init script to issue the command to bring up the interface, everything worked! The machine booted its Fedora7 install right over the network as if the install were on a local hard drive.

As far as usability goes, it's pretty snappy. If you're paying attention, you can see a slight delay in low-latency type things like tab completion at a bash prompt, but loading applications and reading/writing things in bulk is pretty close to on par with a regular local disk.

Now that I've done it, I want to go back and see if it's possible to do a whole install to an AoE export. If I can do the same kernel module tomfoolery before the install process as I did in the initrd, it should be possible to do the install to the AoE share instead of a local disk. Booting would still require PXE, TFTP, and the modified initrd image, but that stuff is easy now that I know how to get things done.

I plan on writing up a more detailed wiki article once I have a bit more motivation, but that time is not now. I didn't have much help from the intarweb in finding this solution, even though there were a few people mentinoning that they'd like to try it. I already get a decent amount of traffic here for my Xen+AoE stuff, so maybe this AoE root stuff will be helpful to others too.

Tags:

Xen + AoE = New Hotness

Thursday, June 14 2007, 8:08 AM

In my continued experimentation with hot migration of Xen environments, I think I've found a pretty awesome solution. It involves a system called ATA over Ethernet (AoE). This system transmits ATA commands over ethernet, so it allows for a remote disk to be treated like local block storage. The system was originally designed by a company called Coraid for use with their own proprietary disk arrays, but they produced a piece of software that replicates the same functionality on a normal linux machine.

I was doing experimentation with using NFS root filesystems, but there were a few things I didn't like about it. First off, creating the kernel was a pain. WIth all of the effort I mentioned in my previous post on the subject, keeping an updated kernel would be a total pain if you were using CentOS 5 like I am. Second, the kernel didn't seem to perform any caching of the NFS filesytems, so there was a large amount of traffic flowing over the network from all of the filesystem reads that the Xen environments were doing. Third, all of the root filesystem reads/writes were visible to the Xen instances, so their bandwidth counters (and their associated graphs in my Cacti system) were skewed by a large amount.

These issues don't seem to occur with AoE. The filesystems are imported on the host, so the stock CentOS Xen kernel doesn't have to be modified in any way. This also renders the network traffic required in maintaining the filesystems invisible to the Xen domains. The filesystem acts as a normal block device, so it is cached like a normal local disk is cached.

That's not to say there weren't issues. At first, the vblade daemon (the linux 'server' component of the AoE system) seemed pretty unstable. It seemed to randomly lock up, causing all of my Xen domains to crash, and forcing a reboot of the host server. I think it was just the way I was using it though. I was running the vblade program and backgrounding it, instead of using the vbladed script that was provided. I think it was locking things up when I disconnected the termnal in which I started the vblade instances. When the controlling PTY died, it caused the vblade instances to die in a bad way due to a lack of standard input and output channels. The vbladed script controls all of the input and output paths, so there's no worry if the terminal disconnects. Since I've started using vbladed, about three weeks ago, I haven't had a single failure.

I'm currently running vbladed against the LVM partitions I used with my NFS root filesystems. Off the bat, I thought this would come up a little short because I didn't have a swap partition available to the Xen domains. Then I remembered that I could use a regular flat file as swap space, so the problem went away.

Since the vblade server allows you to export a whole block device, be it a whole disk, a single partition, a LVM partition, or a whole RAID array, it opens up some interesting possibilities. On the remote system, you can access the exported block device as if it were a disk, partitioning it as you see fit, while on the system exporting, it could be one of many LVM partitions. This allows for the possibility of creating a "mini hard drive" for each Xen instance, each with its own root filesystem, swap space, and whatever else is deemed necessary. I haven't implemented this because I want to be able to use my LVM partitions with NFS if stability becomes an issue, but it would be a pretty neat setup.

Tags:

Related Tags

                       


RSS Feed | Comments RSS Feed | Valid HTML 4.01 | Valid CSS
Memcache: Hits: 45 Misses: 12 Updates: 12 Deletes: 0 LocalHits: 44 Time: 0.0141
MySQL: Selects: 17 Inserts: 4 Updates: 0 Deletes: 0 Time: 1.2918
Page Render Time: 1.4929 seconds