Xen + AoE + drbd = New Redundant Hotness
A few weeks ago, my buddy Justin posed an interesting problem, one that I've been pondering myself for some time. He's somewhat of a Xen zealot like myself, and is doing some Xen setups that are similar in construction to mine, with a central shared storage array and two or more dom0 machines where the child instances will live. The prospect of migrating domUs between dom0s is quite appealing to him, but he, like myself, realized a critical flaw in the setup. If the storage array fails or requires uptime-affecting maintenance of some sort, the whole setup grinds to a halt. That doesn't really fit the goals he and I are both after.
After a bit of thought, I looked to a project Justin had mentioned a few months back called drbd. It's designers deem it "network raid 1", and that's a pretty accurate description. It's essentially a system that mirrors data between two different machines either in an active-standby or active-active configuration. One of its primary goals is to provide storage as close to 100% of the time as is possible. Its usefulness would vary highly depending on the application. Having a normal file system on it shared between two machines could be nothing short of a nightmare, since neither server knows what the other is doing until changes are already written to the shared storage. A clustered file system would work well with it though. As I began to learn more about how it works, I realized it could potentially be a great solution for my predicament. Since either of of the two machines could provide storage at any given time, it would have no problem fufilling the near 100% uptime requirement.
What really makes the solution stand out to me isn't just drbd itself, but the combination of drbd and AoE. AoE is, by design, a connectionless protocol. When the kernel module is loaded, it does a device discovery to see what devices are available for its use, and listens thereafter for new devices. The information it learns is pretty much limited to a MAC address where the storage device is located and the vblade "addresses" within that device that are available. There's nothing within the protocol that outlaws multiple targets from advertising the same vblade "address", and it's up to the AoE initiator in the kernel module to choose where it's sending data. Because of this, you could have two linux vblade targets running on both "ends" of a drbd setup, and there'd be no conflicts whatsoever. The recommended setup in drbd is to consider a write operation as finished only when data has been written to disk on both "ends" of the drbd setup. Combine that with the fact that AoE will only send commands to one MAC address at a time, and its pretty much guaranteed that both vblade targets will be connected to the same data at all times, even though they're on different machines. I can think of a scenario or two where data would be out of sync, but it would require that disk write operations be done in parallel, and I'm farily certain that they aren't.
The fact that the same data is on both machines and that AoE allows for a quick and painless transfer between vblade targets is what makes this such a simple and effective solution for me. There may be a few seconds of lag while the AoE initiator realizes that the machine it was talking to has disappeared, but that will pass as soon as it does another device discovery and sees the other vblade target. This is perfectly acceptable in my usage scenarios thus far.
I took the plunge a week or so ago and started converting my storage at home to use drbds. It's was pretty simple to convert from my LVM-based setup, since all I had to do was create another single LVM for every partition I wanted to sync between machines. These additional LVM partitions store the metadata that drbd uses to track changes and to keep things in sync. This configuration also allows me to revert back to using "naked" LVM partitions as my vblade storage targets if I decide I don't like drbd in the future. I used my MythTV recording backend as the second drbd server, since it has a lot of space for extra drives and is on pretty much all the time. I put in a 120GB drive, and let everything fly. Once the initial synchronization was complete, I did a few tests, and everything worked as intended. I could kill vblade targets on either machine, and after a few seconds, the initiators would look at the other machine and use it for storage. Success!
As of this weekend I've also converted my setup at work to use the same general configuration. The primary storage consists of a big RAID array, with a secondary machine using a single drive as a backup. I figure that in most cases running in an active-active setup wouldn't be necessary, so I'm going to stick with active-standby, and only start the vblade targets on the secondary machine when I'm planning on a reboot or other maintenance event. I've also considered running in an active-off state (with periodic resyncs), so that there wouldn't be any performance hit from waiting for the second server to complete its writes. This would probably be a less desirable setup since the data could (and very likely would) be out of date if I were to suffer an unplanned outage such as a hardware failure. Nothing I run currently is terribly needy in terms of disk write performance, so I'm not terribly concerned at this point.
Nice One, Stupid
Even though they're an over-priced pseudo-monopoly with a track record of shitty customer service and only somewhat better service uptime, I owe Comcast an apology. My internet was down for most of the past weekend, and for most of Monday as well. I figured it was due to the storms that rolled through on the night of the outage (Saturday), but after a couple of days with no service, I started getting mad. I was cursing their name and anything related to them. I was particularly unhappy when I found out that a coworker that lives in my apartment complex had no interruptions in service. After I heard that, I started thinking of ways that my setup would be sabotaging the process.
And then it hit me. As part of my process to convert my firewall machine into a Xen instance, I altered the physical networking layout so my cable modem would plug directly into my "Core Switch", an old Cisco 2924XL. I gave the cable modem service its own VLAN, which would be accessable via my firewall instance running on a Xen machine. What I failed to consider is that managed switches tend to have features that allow for communication with other switches in order to facilitate ease of management and network health. This communication is typically broadcasted to any device that is listening on regular intervals.
These broadcasts are what caused my issue. In a normal residential cable modem service (with Comcast at least), the cable modem latches on to the first network device it hears traffic from, and assumes that it will be the one it deals with when connecting to the internet. By having my cable modem plugged directly into the switch, it was receiving the switch's broadcast messages before my firewall instance had a chance to make itself heard. Because of this, my firewall's attempts to connect to the internet fell on deaf electronic ears.
This was remedied easily enough by disabling spanning tree protocol on the VLAN that my cable modem connects to, and disabling Cisco Discovery Protocol broadcasts on the port it connects to. I don't like disabling spanning tree, because quite frankly, network loops suck. The chance that somehow make a loop in that VLAN is pretty damn low though, so there's not much to worry about.
Let this be a lesson to those with way too much time on their hands, like myself.
Longest Post About Xen Ever.
WARNING: Gratuitious technobabble ahead. Moms and non-linux types may want to just go on with their day and skip this one.
In my last post I alluded to doing some new fun stuff with AoE and Xen. One of the things that I like so much about Xen is its ability to combine the roles typically performed by multiple machines down into one machine. A few things that require specialized hardware, such as my MythTV recording backend machine, are not really feasible to put into virtualized environments, but there are quite a few tasks that work very well. One of my goals with my setups is to move everything I can onto a single "cluster," with a single machine providing shared storage, and two (or more) machines running Xen environments. That way, I can add new instances with distinct functionality without adding (much) hardware.
One of the roles that has been resistant to my Xen transition is my firewall machine. My home network is divided into three separate logical networks, each with a different purpose. At one point, the firewall machine had four ethernet cards in it - one per network, plus one that connected to the internet. I was able to cut that down to two and eventually one physical interface by acquiring some VLAN-aware network gear. By using VLANs at the network level, I was able to push all four networks through one physical link. Having reached that goal, I knew it was feasible to run my firewall machine in a Xen instance. Actually, I knew it was possible since I'd read about other people doing it, but those stories came with very little in the way of implementation detail. I figured that if I could get the same VLAN information into a Xen instance, my goal would be attainable. In my ideal setup, I would be able to pass VLAN encapsulated traffic to an instance, traffic not encapsulated in a VLAN, or both.
Getting VLANs to work properly inside of a Xen instance was not as easy as I would have liked, however. I gave it a cursory attempt a few weeks ago, and it failed pretty badly, so I moved on to something prettier and shinier. I didn't really dig much into why it failed, but it turns out that it was pretty simple. In my intial attempt, I tried to run the standard Xen network-bridge script against a VLAN interface defined by the 8021q kernel module. When I attempted to create the Xen network bridge on this interface, everything seemed to work, but in the end, only the bridge itself was present. The VLAN interface and its associated virtual interface were nowhere to be found. I tweaked a few things, with no change in the end result. I must not have been feeling particularly curious that day, and I let the attempts end in failure.
I decided to give things another shot last weekend. I searched around and found a few other people making the same attempts I was. One guy even published a few scripts that allowed him to connect VLAN interfaces to his network bridges so that each instance could have its own VLAN. While somewhat similar to my goal of passing both encapsulated and non-encapsulated traffic into my Xen instances, it wasn't exactly the same. I looked over his scripts in the attempt to figure out what they were doing, and then looked at the default scripts that came with Xen. After an hour or two of digging through code, I realized why my previous attempt failed. When the network-bridge script does its thing, it takes an interface, puts it into an inactive state, renames it, gives its old name to one of the virtual ethernet interfaces provided by the netloop kernel module, and then attaches both of the aforementioned interfaces to a newly-created bridge interface. The deactivation of the physical interface is where things went awry. The script makes a call to the 'ifdown' script, which deactivates the interface. On a normal physical interface, there isn't much to do short of just downing the interface, but with a VLAN interface, it actually destroys the interface in the process of disabling it. I had found my linchpin... so I thought.
I whipped up a quick patch to alter the behavior of the network-bridge script so it wouldn't make the call to ifdown, which preserved the VLAN interface. I did a few tests, and things worked as I felt they should. I set up all of the VLAN interfaces in the domain0 environment, configured the network-bridge script to create a Xen bridge for each of them, and then let it rip. All of the interfaces that needed to be renamed were renamed, all the proper bridges were created, and everything was connected where it should have been connected. Everything looked great - except it completely didn't work. After I changed the network port on my test machine to trunk all VLANs instead of just utilizing my main VLAN, everything stopped working, even though I had the right network stuff configured on the VLAN interfaces in the domain0 environment. I did some simple ping/tcpdump tests, and everything looked right, except the ping traffic was never making it into the dom0 like it should have. Outbound pings were visible in the proper VLAN interface, they egressed through the proper phsycial interface, made it to the destination machine, came back in through the proper physical interface... and then disappeared. The packets never made it back to the "physical" VLAN interface in the dom0, which prevented any kind of success.
After a good amount of expletives and some pacing around the apartment, I had an idea. My broken setup was connecting two separate types of virtual interfaces to the physical ethernet device in the whole initialization process - the VLAN interfaces and the bridge for trunked VLAN traffic. I decided to see if my missing packets were entering the bridge instead of the VLAN interface, and sure enough, they were. I had found another linchpin.
Seeing as how the traffic flow seemed to favor the bridge over the VLAN interfaces, I thought I was stuck. If the traffic is going into the bridge, its game over for the non-VLAN-encapsulated traffic, right? Wrong. The whole purpose of the network-bridge script is to masquerade the physical interface, replace it with a virtual one that is only visible to the dom0 environment, and connect them both to a bridge that is used pass data in and out of the dom0 and domU environments. The virtual interface acts as a normal interface in the dom0, so I figured it would be worth a try to configure the VLAN interfaces on the virtual interface instead of the physical one, and then build their bridges from that. I was a bit pessimistic at this point and really didn't expect it to work, but it sure did. Traffic started flowing properly to the dom0 environment, which meant the domUs would probably work too. I brought up a test environment, and all of the networking stuff worked as intended. It could see both encapsulated and non-encapsulated VLAN traffic.
It was Miller Time at that point, and pretty late to boot. I had my celebratory beer and hit the sack soon after. I waited until this past weekend to attempt my goal of moving my firewall setup into a Xen instance. I've become pretty good at moving Linux environments between physical machines and virtual ones, so the actual move was pretty painless. Once everything was ready, I stopped the networking stuff on my physical firewall machine and brought it up in the virtual one. The only issue to speak of was my cable modem locking in on the MAC address of the ethernet card in my physical firewall machine. I changed the virtual MAC address inside of the firewall Xen instance to match the physical firewall interface, and everything fell into place. I now have a firewall instance that I can migrate between my Xen host machines with no interruption. Great success!
I know I promised diagrams, but I'm not going to put them here. This post is already long enough as it is. I plan on making a wiki entry for the actual technical details (not just the technical description) of the process, so stay tuned for that. My diagrams will be in there.
Update From the Team of One
It seems that posting here regularly is getting more and more problematic. It seems as though I'm busy a lot of the time, but most of the stuff I'm busy with is mundaine and would be boring to talk about, or its stuff that shouldn't be talked about in a public forum. I've been a lot busier with human-oriented things at work as opposed to the purely technical side of things. It's definitely taking a while to get accustomed to. I still lack team members for my group, but that's my fault as much as anybody else's. I still need to come up with a list of qualifications that are required/desired for people interested in joining the team. It seems easy enough to do, but every time I think of it, I come up with more and more items to put on that list. It may be a never-ending task... :)
Other than work, there isn't a whole lot of things going on. I've been doing some more playing with AoE/Xen at home, and have succeeded in getting some things working that should open some doors regarding what I can do with those tools, but that's a subject for another post... One with drawings and diagrams. :)
Oh yah... It's getting warm! Yay for spring!
Complete Overkill
This past saturday I spent the whole day at work. Over twelve hours. The day was boring for the most part, tedious, and I even fell asleep once or twice at my desk. Not good you say? The thing is, I wasn't working. I was migrating data off my personal RAID array so that i could expand it to encompass the second RAID cabinet that I just finished acquiring disks for.
This whole saga goes back to the beginning of the summer. I was in the middle of a pretty large eBay binge, all the while buying tons of compter stuff I didn't really need. While I certainly didn't need the stuff, having it around has allowed me to further my knowledge in things like Xen and play with other cool technologies like ATA over Ethernet (AoE). Anyway, during my browsing, I came across a RAID cabinet completely loaded out with 18GB drives pretty much identical to one I had purchased earlier on. The price was pretty cheap, so I nabbed it.
The customary week passed while UPS ground transported the 100 pound-plus package from its origins on the west coast to its new home in Lansing. When it arrvied, I eagerly opened up the box and inventoried its contents, only to find that I wasn't given all that I was promised. The RAID cabinet itself, the full compliment of twelve 18GB drives, and the mounting rails were all there, but the two external SCSI cables were missing. I promptly got in touch with the seller, and he assured me that it was just an oversight and that he would send the cables right away. Right away turned into a couple of weeks. I finally received the delivery, only to find that he had sent the wrong cables. They would easily fit into the ports on the back of the RAID card in my server, but the other end had the wrong connector type, and wouldn't connect to the RAID enclosure. I got back in touch with the seller again, and he again promised to right the wrong, and at no charge. Well, those cables never arrived. I was ticked, but I got over it.
Over the next few months I replaced the 18GB disks with 36GB disks so that I could merge the second RAID cabinet into my current RAID array, which is comprised of 36GB disks. Used 36GB drives are relatively inexpensive on eBay so I was able to get the requisite number of disks without too much expense. I also ordered the cables that I was denied in my original purchase.
All the pieces were in place last week, so I decided to try getting everything set up. I got everything put together and connected this past thursday night, and started the process of finalizing things. The RAID card I'm using in the machine (a Compaq/HP Smart Array 5304-128) supports live expansion of an array onto new disks, so I fired up that process from the command line array management utility. I was mildly surprised that I could only extend onto seven of the twelve disks in the second RAID cabinet. I figured that it was just something in the way the expansion algorithm worked, and that I'd be able to expand onto the other disks when the initial expansion finished.
Rearranging 500GB of data on disks is not a quick process, so I had to wait a good twelve hours before I would discover that I was not able to expand the array onto those remaning five disks. I searched around for possible reasons why I was having difficulty and found nothing concrete. I did come across a blurb somewhere that mentioned RAID6 allowing for more disks than RAID5 in an array, which didn't really seem correct to me, but I decided to try converting the array to RAID6 anyway. Another twelve hours later, I discovered that didn't work either.
I decided to completely redo the RAID array from scratch, which leads us to last saturday. I was up pretty early, so I made myself a big breakfast and headed into work to do the deed. I got started copying the data off the RAID array, which was no quick process. It takes a while to copy over a hundred gigabytes of data. During this time, I played some games on the laptop, walked around the datacenter, chatted with people, and took a few naps. There's a surprisingly comfortable position I've discovered on my desk where I can just zonk right out.... when I'm not working of course! Anyway, once the data was copied off, I recreated the RAID array and was glad to see that I could use all the disks to their full capacity.
I began the process of copying data back onto the array, and settled in for another nap. I awoke a short time later to find that one of the drives in the array had croaked. It turned out to be one of the disks in the newest group I had purchased, and not really tested all that thoroughly. A disk failure sounds bad, but in RAID6, up to two disks in the array can fail without any data loss. So, in this case, it was just an annoyance. The RAID card detected the failure and immediately started working to rebuild the failed disk's data on one of the hot spares I have configured. This process slowed down the data transfer pretty substantially, so I was there for a lot longer than I wanted to be. I replaced the failed disk with one of my leftover disks, and the array was happy again.
So, now I have a nice RAID6 array composed of 26 36GB SCSI drives and two hot spares. It weighs in at around 860GB total, and takes up 11U worth of space on the racks at work. I find it pretty funny that there are now single SATA drives that can store more data than that entire array, but they are nowhere near as cool. I have the blinky light factor on my side, as evidenced here...
![[RAID array as of 2007-10-27]](/siteimages/RAIDsetup-20071027.jpg)
More Fun With AoE
In another instance of me finding a solution to an imaginary problem, I've succeed in creating a diskless workstation that uses an AoE-exported LVM partition as its root filesystem. I've been wanting to try it out for a while, and this weekend became the time. The setup turned out to be pretty easy once I got past some of the technical hurdles. I didn't do a complete install over AoE this time around, but I've proven it's at least possible to make to make it go using an image made on another machine.
The setup I used is similar to what I've got going with my Xen + AoE setup on my colo boxes at work. I've got a LVM partition that houses the files, and a vblade server exporting that LVM partition over AoE. Out of laziness, I just copied the Fedora7 install from my linux workstation into the LVM partition to provide something I could try booting.
The client side of AoE is provided by a kernel module, so it's not hard to get things working in normal circumstances. Getting it to work on boot was a bit more of a chore, but was pretty easy once I got past the stumbling blocks. I used gPXE/Etherboot on a CD-ROM to get things into a network-bootable state since the onboard network boot stuff on the machine I used is crap and doesn't work. From there, I loaded pxegrub, which then grabbed the kernel and initrd image from my file server via TFTP. I've been booting my MythTV frontend in a similar manner for some time now, only using NFS as the root filesystem.
Getting the initrd right was the biggest hurdle in the whole endeavor. Using mkinitrd I was able to get the AoE and NIC drivers to load without issue. I knew I would have to modify the initrd to include the proper device nodes so that the init script could communicate with the AoE module and see the block devices used to tie into the export, but it wouldn't see the export no matter what I did. Some hours later and after the creation of many initrd files, it occured to me that the network interface was never brought up into an active state. The only way I know to bring up an interface is with ifconfig, so I copied that into the initrd along with the shared libraries it needed. After modifying the init script to issue the command to bring up the interface, everything worked! The machine booted its Fedora7 install right over the network as if the install were on a local hard drive.
As far as usability goes, it's pretty snappy. If you're paying attention, you can see a slight delay in low-latency type things like tab completion at a bash prompt, but loading applications and reading/writing things in bulk is pretty close to on par with a regular local disk.
Now that I've done it, I want to go back and see if it's possible to do a whole install to an AoE export. If I can do the same kernel module tomfoolery before the install process as I did in the initrd, it should be possible to do the install to the AoE share instead of a local disk. Booting would still require PXE, TFTP, and the modified initrd image, but that stuff is easy now that I know how to get things done.
I plan on writing up a more detailed wiki article once I have a bit more motivation, but that time is not now. I didn't have much help from the intarweb in finding this solution, even though there were a few people mentinoning that they'd like to try it. I already get a decent amount of traffic here for my Xen+AoE stuff, so maybe this AoE root stuff will be helpful to others too.
Weekend Update
First things first. The Star Wars Family Guy episode is absolutely freakin hilarious. It is, hands down, the best episode of Family Guy ever. EVER. Moving on...
This weekend has been an exercise in patience. It was the first weekend in a while where I didn't have some preexisting commitment, so I decided I'd try to get a few projects knocked out. I've been meaning to upgrade my Xen server here at home and make use of some of the half-height SCSI drives I mis-ordered a few months back. It turned into a much larger headache than I would have expected - mainly due to my stubbornness. I started the process of setting up the new environment in one of the machines Justin so graciously donated to the Mike Neir Hardware Fund a few months back. It's a nice server piece with a lot of space for drives, so it worked out well with the four SCSI drives I set up in RAID5. The problem lied in the fact that there were only four power leads coming out of the power supply. That meant no CD-ROM and no easy install. Sure, I could have put another power supply in the machine, but I was lazy yet determined to impose my will upon it, so I decided to make due without replacing the power supply. The machine gave me a good fight though. I ended up using cluster knoppix in terminal server mode to PXE boot the problem machine so I could copy the OS from my current Xen server to said problem machine. That worked, but I had a lot of trouble with the custom Xen kernel I had on the old app server. So, in the interests of laziness and getting everything working, I decided to uprade the whole thing from CentOS 4.4. to CentOS 5, which has a pretty and non-difficult Xen kernel right out of the box. Twelve hours after I started, I prevailed. I probably could have had everything working in less than an hour if I wasn't lazy and/or stubborn, but doing things the hard way always seems to yield more knowledge than the easy way. I'm a fan of learning and honing my skills, so it's all good.
After my difficulties with the Xen server, I performed the long-overdue task of cleaning out my fridge. I really need to get better at the task of managing left-overs methinks. There were some items in there that probably dated back to the first Clinton administration. After my cleaning task was complete, I put together a shopping list in the attempt to refill the fridge. I hadn't been shopping for more than anything but a gallon of milk or mountain dew in close to two months. I made a strong focus on acquiring a lot of raw materials instead of pre-made items since I've been experimenting a lot more with actually preparing food instead of just warming it up. I picked up just about every spice I've ever heard of, and the receipt reflected it. $276. Yeah. Two hundred and seventy-six dollars. Now I need to put the recipe book I have into action and make some tasty food.
Now I'm unhappy and less motivated to post. The Star Wars episode of Family guy displaced Metalocalypse, which isn't a good thing. Grumblefest.
Xen + AoE = New Hotness
In my continued experimentation with hot migration of Xen environments, I think I've found a pretty awesome solution. It involves a system called ATA over Ethernet (AoE). This system transmits ATA commands over ethernet, so it allows for a remote disk to be treated like local block storage. The system was originally designed by a company called Coraid for use with their own proprietary disk arrays, but they produced a piece of software that replicates the same functionality on a normal linux machine.
I was doing experimentation with using NFS root filesystems, but there were a few things I didn't like about it. First off, creating the kernel was a pain. WIth all of the effort I mentioned in my previous post on the subject, keeping an updated kernel would be a total pain if you were using CentOS 5 like I am. Second, the kernel didn't seem to perform any caching of the NFS filesytems, so there was a large amount of traffic flowing over the network from all of the filesystem reads that the Xen environments were doing. Third, all of the root filesystem reads/writes were visible to the Xen instances, so their bandwidth counters (and their associated graphs in my Cacti system) were skewed by a large amount.
These issues don't seem to occur with AoE. The filesystems are imported on the host, so the stock CentOS Xen kernel doesn't have to be modified in any way. This also renders the network traffic required in maintaining the filesystems invisible to the Xen domains. The filesystem acts as a normal block device, so it is cached like a normal local disk is cached.
That's not to say there weren't issues. At first, the vblade daemon (the linux 'server' component of the AoE system) seemed pretty unstable. It seemed to randomly lock up, causing all of my Xen domains to crash, and forcing a reboot of the host server. I think it was just the way I was using it though. I was running the vblade program and backgrounding it, instead of using the vbladed script that was provided. I think it was locking things up when I disconnected the termnal in which I started the vblade instances. When the controlling PTY died, it caused the vblade instances to die in a bad way due to a lack of standard input and output channels. The vbladed script controls all of the input and output paths, so there's no worry if the terminal disconnects. Since I've started using vbladed, about three weeks ago, I haven't had a single failure.
I'm currently running vbladed against the LVM partitions I used with my NFS root filesystems. Off the bat, I thought this would come up a little short because I didn't have a swap partition available to the Xen domains. Then I remembered that I could use a regular flat file as swap space, so the problem went away.
Since the vblade server allows you to export a whole block device, be it a whole disk, a single partition, a LVM partition, or a whole RAID array, it opens up some interesting possibilities. On the remote system, you can access the exported block device as if it were a disk, partitioning it as you see fit, while on the system exporting, it could be one of many LVM partitions. This allows for the possibility of creating a "mini hard drive" for each Xen instance, each with its own root filesystem, swap space, and whatever else is deemed necessary. I haven't implemented this because I want to be able to use my LVM partitions with NFS if stability becomes an issue, but it would be a pretty neat setup.
Hot Migration Action
As described in my last few posts, I've recently acquired a good amount of new server hardware. Well, everything is in my posession now except a few sticks of RAM, and it's all set up at work. I ended up picking up the RAID enclosure I mentioned earlier, along with disks to fill it. It ended up being quite a bargain, with the enclosure, drive trays, and a external SCSI cable only costing around $50 plus shipping. Here's all the new gear mounted in a rack at work... My stuff is the white stuff in a sea of black servers.
I've got the RAID enclosure connected to the dual P3 1.0GHz machine I bought (furthest away on the bottom), and combined, there's 18 drive bays available to the SCSI system. I've got fifteen 36GB drives (plus one hot spare) in a RAID5 storage array and two 18GB drives in RAID1 for the OS installation. The RAID5 array weighs in at about 500GB, so I have plenty of room to keep stuff that I don't want to lose.
I'm currently seeing how well my Xen domains function with NFS root filesystems. So far it looks pretty good. I've got the domains that host my web site (among other things) and the mysql domain running off the RAID5 array via NFS, and I haven't noticed any slowdowns whatsoever. The only unexpected thing I've come across is a few weird incompatibilities with the Gentoo init scripts, specifically when it tries to bring up networking devices. It just hangs up when trying to initialize eth1, which is the interface that the NFS root filesystem is accessed through. My firewall script also kills things, but I should be able to fix that.
Having things running over NFS allows for live migration of running domains. I tried it out a few hours ago, and it's surprisingly painless, given that the appropriate functionality is enabled in the Xen daemon. One command sends a running domain between physical Xen hosts, which is pretty damned neat. I can see this being tremendously useful in a high-availabilty sort of environment. If a host machine needs maintenance, you can simply transfer the running child domain to another host, do your business, and transfer it back with only a fraction of a second of downtime.
Xen+NFS Root Filesystem Madness
As part of my continuing experimentation with Xen, I decided a while back to try running the child environments (domUs) from a NFS root filesystem, so I could play with hot-migrating domUs between Xen hosts. I just started playing with it a couple nights ago, and what a pain in the ass its been.
First off, I'm using CentOS 5 as the Xen host operating system (dom0) because it's got Xen support built right in. Handy right? Sure. It does not, however, have support for NFS root filesystems built into the Xen kernels it supplies. Not a big deal - I compile my own kernels all the time. I added the proper options into the kernel - IP Autoconfiguration support, NFS client support, and NFS root filesytem support - and I went on my way.
That wasn't the end of the trouble. While I could get the domU to use the NFS share as its root filesystem, it wasn't accessing it properly. The root user had no permissions to write to anything, so everything was broken. This is typical of a NFS share with the "root_squash" option enabled, but I specified that my share be expored with the opposite setting enabled ("no_root_squash"). No matter what I did, I couldn't find out why root squashing was happening. I could mount the share just fine from another machine, and root squashing wasn't happening.
I decided to look at the differences between the mount parameters between my broken domU and the working system. There were a few differences, but the thing that was causing problems was "sec=null". That setting disables all authentication for the mount, and all access is mapped to the anonymous user specified on the NFS server.
I had found my problem, but the solution eluded me. I tried every way I could think of to change the mount parameters, but nothing worked. Then I stumbled across this post to the Linux Kernel Mailing List. Apparently, something was broken in the NFS kernel code in the 2.6.18 release that has to do with properly identifying what NFS server version one is connecing to. CentOS uses the 2.6.18 kernel. I tried applying the patch described in the post, and voila! Everything works!
With everything working, I was able to play with a few other things. I have two physical networks in my Xen boxes, one public and one private. All domUs are connected to the public network on eth0, and the private network is connected on eth1. I want to mount the NFS shares on the private network, but the default Xen configuration directives only seem to allow mounting NFS roots via eth0. I got around this by specifying the IP configuration stuff in the "extra" directive instead of the ip, netmask, and gateway directives. Here's the relevant portion of the config file.
nfs_root="/xen/domains/test" nfs_server="192.168.3.10" root="/dev/nfs" extra="ip=192.168.3.150:192.168.3.10:192.168.3.4:255.255.255.0::eth1:"
Now for more experimentation!
Related Tags
|