Mike Neir's Page[ignignokt][err]
Mike can finally have Eve schizophrenia. Fun! (8 days ago)
Viewing 6 posts tagged with 'raid'
(Oldest First :: Newest First)
Show related: del.icio.us links, tags

Xen + AoE + drbd = New Redundant Hotness

Tuesday, July 22 2008, 2:17 AM

A few weeks ago, my buddy Justin posed an interesting problem, one that I've been pondering myself for some time. He's somewhat of a Xen zealot like myself, and is doing some Xen setups that are similar in construction to mine, with a central shared storage array and two or more dom0 machines where the child instances will live. The prospect of migrating domUs between dom0s is quite appealing to him, but he, like myself, realized a critical flaw in the setup. If the storage array fails or requires uptime-affecting maintenance of some sort, the whole setup grinds to a halt. That doesn't really fit the goals he and I are both after.

After a bit of thought, I looked to a project Justin had mentioned a few months back called drbd. It's designers deem it "network raid 1", and that's a pretty accurate description. It's essentially a system that mirrors data between two different machines either in an active-standby or active-active configuration. One of its primary goals is to provide storage as close to 100% of the time as is possible. Its usefulness would vary highly depending on the application. Having a normal file system on it shared between two machines could be nothing short of a nightmare, since neither server knows what the other is doing until changes are already written to the shared storage. A clustered file system would work well with it though. As I began to learn more about how it works, I realized it could potentially be a great solution for my predicament. Since either of of the two machines could provide storage at any given time, it would have no problem fufilling the near 100% uptime requirement.

What really makes the solution stand out to me isn't just drbd itself, but the combination of drbd and AoE. AoE is, by design, a connectionless protocol. When the kernel module is loaded, it does a device discovery to see what devices are available for its use, and listens thereafter for new devices. The information it learns is pretty much limited to a MAC address where the storage device is located and the vblade "addresses" within that device that are available. There's nothing within the protocol that outlaws multiple targets from advertising the same vblade "address", and it's up to the AoE initiator in the kernel module to choose where it's sending data. Because of this, you could have two linux vblade targets running on both "ends" of a drbd setup, and there'd be no conflicts whatsoever. The recommended setup in drbd is to consider a write operation as finished only when data has been written to disk on both "ends" of the drbd setup. Combine that with the fact that AoE will only send commands to one MAC address at a time, and its pretty much guaranteed that both vblade targets will be connected to the same data at all times, even though they're on different machines. I can think of a scenario or two where data would be out of sync, but it would require that disk write operations be done in parallel, and I'm farily certain that they aren't.

The fact that the same data is on both machines and that AoE allows for a quick and painless transfer between vblade targets is what makes this such a simple and effective solution for me. There may be a few seconds of lag while the AoE initiator realizes that the machine it was talking to has disappeared, but that will pass as soon as it does another device discovery and sees the other vblade target. This is perfectly acceptable in my usage scenarios thus far.

I took the plunge a week or so ago and started converting my storage at home to use drbds. It's was pretty simple to convert from my LVM-based setup, since all I had to do was create another single LVM for every partition I wanted to sync between machines. These additional LVM partitions store the metadata that drbd uses to track changes and to keep things in sync. This configuration also allows me to revert back to using "naked" LVM partitions as my vblade storage targets if I decide I don't like drbd in the future. I used my MythTV recording backend as the second drbd server, since it has a lot of space for extra drives and is on pretty much all the time. I put in a 120GB drive, and let everything fly. Once the initial synchronization was complete, I did a few tests, and everything worked as intended. I could kill vblade targets on either machine, and after a few seconds, the initiators would look at the other machine and use it for storage. Success!

As of this weekend I've also converted my setup at work to use the same general configuration. The primary storage consists of a big RAID array, with a secondary machine using a single drive as a backup. I figure that in most cases running in an active-active setup wouldn't be necessary, so I'm going to stick with active-standby, and only start the vblade targets on the secondary machine when I'm planning on a reboot or other maintenance event. I've also considered running in an active-off state (with periodic resyncs), so that there wouldn't be any performance hit from waiting for the second server to complete its writes. This would probably be a less desirable setup since the data could (and very likely would) be out of date if I were to suffer an unplanned outage such as a hardware failure. Nothing I run currently is terribly needy in terms of disk write performance, so I'm not terribly concerned at this point.

Tags:

Quick Update

Friday, May 02 2008, 11:15 PM

Touching on a few things from my last post...

Moving the bed to the other bedroom seems to have helped. I've slept pretty well the past few days. This is a good thing.

I was finally able to get the firmware on my RAID card updated. It only took about 5 hours worth of time, 5 wasted CD-ROMs, and a lot of annoyance. The result? The RAID array rebuilt itself for two days (pretty damn long for an array of that size), and the card promptly kernel panicked upon reaching 100% on the rebuild process. At least it didn't start over again like it had in the past.

I think I might need professional help. This morning, I built a small cardboard baffle (?) to direct air out of a HVAC duct towards the servers in my closet. People more sane than myself would probably say things like "you should have fewer computers", "why do you have so many computers?", or even the dreaded "you're freakin' weird!" I, however, build cardboard ducting.

[cardboard baffle]

I also caught this guy outside my window the other morning. I'm surprised the picture came out so clear... the window I took the picture through is dirty as hell.

[a bluejay on a branch]

Tags:

Bad Form HP... Bad Form Indeed

Tuesday, April 29 2008, 11:14 PM

So, I'm sitting here in my newly reconfigured den/office/computer room/whatever (more on that later) waiting for CD images to download. Multiple CD images. I'm trying to upgrade the firmware for a SATA RAID card I eBay'd a few weeks back to handle the new 2TB RAID array I was building. It seemed to work fine, but after some use, it showed a nasty habit. The firmware seems to be buggy somehow, as I receive messages in my server's dmesg output about the adapter kernel panicking. I found it strange that a RAID card could kernel panic, but it makes sense when you consider the fact that firmware is just software for a hardware device. The card seems to recover itself about 80% of the time, but that other 20% involves hardware hangups, and hard reboots. No fun... especially when this box is providing storage for multiple other machines.

Back to the firmware. I'm trying to download the latest firmware for this card, but HP isn't making at all easy. Instead of putting a link to a specific file, or set of files, they just have links to their "Firmware Maintenance" CDs, which contain updated firmware for damn near everything they've unleashed upon the world in the past X years worth of product releases. Thing is, there's no documentation anywhere I can find about what each CD covers, what firmware versions are present, or anything like that.

Case in point - I downloaded the newest CD listed on the support page for the RAID card, and it didn't have any firmware for the card listed whatsoever. 650MB downloaded and a CD wasted, all for nothing. I try the oldest CD they have listed, and sure enough, it has firmware on it for my card. I was left wondering if there was a newer firmware somewhere on one of those intermediate CDs though. So here I am, downloading more CD images, and wasting more CDs. Grand! What's really great is that if I want to actually install one of these firmware updates, I need to find three working floppy disks that I can use to actually perform the upgrade because the embedded boot process won't install things automatically. Why? Even though the server is a HP-branded server, it's not of the proper vintage, so the boot loader/installer thing won't work. What a pile of crap.

Now that I'm done venting about stupidity (while downloading yet another CD)...

I've been feeling pretty run down over the past month or so. I attribute it mostly to what I would consider poor sleep. I typically sleep for a good 7-8 hours, but I haven't woke up feeling rested in quite a while. I have one working theory as to why, and it involves the server I'm working on right now, strangely enough. The simple fact is that the thing is loud. It's buried in my walk in closet with my other server gear, but it still had no issue polluting my bedroom (which is attached to said closet) with gratuitous amounts of noise. It seems like my poor sleep roughly coincides with the introduction of that machine into my closet, so I decided to shift things around a bit to see if it would help. I switched the roles of my two bedrooms - the larger bedroom became my office/computer room/underground lair, and the smaller room became the bedroom. It remains to be seen whether it will help much, but I liked this arrangement better when I used to have it. It also allows me to move the second cat litter box out of the bathroom, which is reason enough to move everything around in my opinion. They've never really been very good at keeping litter off the tile, which sucks. I'm not really a fan of walking on cat litter, especially when I'm fresh out of the shower and my feet are wet.

Tags:

Complete Overkill

Tuesday, October 30 2007, 10:47 AM

This past saturday I spent the whole day at work. Over twelve hours. The day was boring for the most part, tedious, and I even fell asleep once or twice at my desk. Not good you say? The thing is, I wasn't working. I was migrating data off my personal RAID array so that i could expand it to encompass the second RAID cabinet that I just finished acquiring disks for.

This whole saga goes back to the beginning of the summer. I was in the middle of a pretty large eBay binge, all the while buying tons of compter stuff I didn't really need. While I certainly didn't need the stuff, having it around has allowed me to further my knowledge in things like Xen and play with other cool technologies like ATA over Ethernet (AoE). Anyway, during my browsing, I came across a RAID cabinet completely loaded out with 18GB drives pretty much identical to one I had purchased earlier on. The price was pretty cheap, so I nabbed it.

The customary week passed while UPS ground transported the 100 pound-plus package from its origins on the west coast to its new home in Lansing. When it arrvied, I eagerly opened up the box and inventoried its contents, only to find that I wasn't given all that I was promised. The RAID cabinet itself, the full compliment of twelve 18GB drives, and the mounting rails were all there, but the two external SCSI cables were missing. I promptly got in touch with the seller, and he assured me that it was just an oversight and that he would send the cables right away. Right away turned into a couple of weeks. I finally received the delivery, only to find that he had sent the wrong cables. They would easily fit into the ports on the back of the RAID card in my server, but the other end had the wrong connector type, and wouldn't connect to the RAID enclosure. I got back in touch with the seller again, and he again promised to right the wrong, and at no charge. Well, those cables never arrived. I was ticked, but I got over it.

Over the next few months I replaced the 18GB disks with 36GB disks so that I could merge the second RAID cabinet into my current RAID array, which is comprised of 36GB disks. Used 36GB drives are relatively inexpensive on eBay so I was able to get the requisite number of disks without too much expense. I also ordered the cables that I was denied in my original purchase.

All the pieces were in place last week, so I decided to try getting everything set up. I got everything put together and connected this past thursday night, and started the process of finalizing things. The RAID card I'm using in the machine (a Compaq/HP Smart Array 5304-128) supports live expansion of an array onto new disks, so I fired up that process from the command line array management utility. I was mildly surprised that I could only extend onto seven of the twelve disks in the second RAID cabinet. I figured that it was just something in the way the expansion algorithm worked, and that I'd be able to expand onto the other disks when the initial expansion finished.

Rearranging 500GB of data on disks is not a quick process, so I had to wait a good twelve hours before I would discover that I was not able to expand the array onto those remaning five disks. I searched around for possible reasons why I was having difficulty and found nothing concrete. I did come across a blurb somewhere that mentioned RAID6 allowing for more disks than RAID5 in an array, which didn't really seem correct to me, but I decided to try converting the array to RAID6 anyway. Another twelve hours later, I discovered that didn't work either.

I decided to completely redo the RAID array from scratch, which leads us to last saturday. I was up pretty early, so I made myself a big breakfast and headed into work to do the deed. I got started copying the data off the RAID array, which was no quick process. It takes a while to copy over a hundred gigabytes of data. During this time, I played some games on the laptop, walked around the datacenter, chatted with people, and took a few naps. There's a surprisingly comfortable position I've discovered on my desk where I can just zonk right out.... when I'm not working of course! Anyway, once the data was copied off, I recreated the RAID array and was glad to see that I could use all the disks to their full capacity.

I began the process of copying data back onto the array, and settled in for another nap. I awoke a short time later to find that one of the drives in the array had croaked. It turned out to be one of the disks in the newest group I had purchased, and not really tested all that thoroughly. A disk failure sounds bad, but in RAID6, up to two disks in the array can fail without any data loss. So, in this case, it was just an annoyance. The RAID card detected the failure and immediately started working to rebuild the failed disk's data on one of the hot spares I have configured. This process slowed down the data transfer pretty substantially, so I was there for a lot longer than I wanted to be. I replaced the failed disk with one of my leftover disks, and the array was happy again.

So, now I have a nice RAID6 array composed of 26 36GB SCSI drives and two hot spares. It weighs in at around 860GB total, and takes up 11U worth of space on the racks at work. I find it pretty funny that there are now single SATA drives that can store more data than that entire array, but they are nowhere near as cool. I have the blinky light factor on my side, as evidenced here...

[RAID array as of 2007-10-27]

Tags:

Xen + AoE = New Hotness

Thursday, June 14 2007, 8:08 AM

In my continued experimentation with hot migration of Xen environments, I think I've found a pretty awesome solution. It involves a system called ATA over Ethernet (AoE). This system transmits ATA commands over ethernet, so it allows for a remote disk to be treated like local block storage. The system was originally designed by a company called Coraid for use with their own proprietary disk arrays, but they produced a piece of software that replicates the same functionality on a normal linux machine.

I was doing experimentation with using NFS root filesystems, but there were a few things I didn't like about it. First off, creating the kernel was a pain. WIth all of the effort I mentioned in my previous post on the subject, keeping an updated kernel would be a total pain if you were using CentOS 5 like I am. Second, the kernel didn't seem to perform any caching of the NFS filesytems, so there was a large amount of traffic flowing over the network from all of the filesystem reads that the Xen environments were doing. Third, all of the root filesystem reads/writes were visible to the Xen instances, so their bandwidth counters (and their associated graphs in my Cacti system) were skewed by a large amount.

These issues don't seem to occur with AoE. The filesystems are imported on the host, so the stock CentOS Xen kernel doesn't have to be modified in any way. This also renders the network traffic required in maintaining the filesystems invisible to the Xen domains. The filesystem acts as a normal block device, so it is cached like a normal local disk is cached.

That's not to say there weren't issues. At first, the vblade daemon (the linux 'server' component of the AoE system) seemed pretty unstable. It seemed to randomly lock up, causing all of my Xen domains to crash, and forcing a reboot of the host server. I think it was just the way I was using it though. I was running the vblade program and backgrounding it, instead of using the vbladed script that was provided. I think it was locking things up when I disconnected the termnal in which I started the vblade instances. When the controlling PTY died, it caused the vblade instances to die in a bad way due to a lack of standard input and output channels. The vbladed script controls all of the input and output paths, so there's no worry if the terminal disconnects. Since I've started using vbladed, about three weeks ago, I haven't had a single failure.

I'm currently running vbladed against the LVM partitions I used with my NFS root filesystems. Off the bat, I thought this would come up a little short because I didn't have a swap partition available to the Xen domains. Then I remembered that I could use a regular flat file as swap space, so the problem went away.

Since the vblade server allows you to export a whole block device, be it a whole disk, a single partition, a LVM partition, or a whole RAID array, it opens up some interesting possibilities. On the remote system, you can access the exported block device as if it were a disk, partitioning it as you see fit, while on the system exporting, it could be one of many LVM partitions. This allows for the possibility of creating a "mini hard drive" for each Xen instance, each with its own root filesystem, swap space, and whatever else is deemed necessary. I haven't implemented this because I want to be able to use my LVM partitions with NFS if stability becomes an issue, but it would be a pretty neat setup.

Tags:

Hot Migration Action

Tuesday, May 15 2007, 6:26 AM

As described in my last few posts, I've recently acquired a good amount of new server hardware. Well, everything is in my posession now except a few sticks of RAM, and it's all set up at work. I ended up picking up the RAID enclosure I mentioned earlier, along with disks to fill it. It ended up being quite a bargain, with the enclosure, drive trays, and a external SCSI cable only costing around $50 plus shipping. Here's all the new gear mounted in a rack at work... My stuff is the white stuff in a sea of black servers.

[servers]

I've got the RAID enclosure connected to the dual P3 1.0GHz machine I bought (furthest away on the bottom), and combined, there's 18 drive bays available to the SCSI system. I've got fifteen 36GB drives (plus one hot spare) in a RAID5 storage array and two 18GB drives in RAID1 for the OS installation. The RAID5 array weighs in at about 500GB, so I have plenty of room to keep stuff that I don't want to lose.

I'm currently seeing how well my Xen domains function with NFS root filesystems. So far it looks pretty good. I've got the domains that host my web site (among other things) and the mysql domain running off the RAID5 array via NFS, and I haven't noticed any slowdowns whatsoever. The only unexpected thing I've come across is a few weird incompatibilities with the Gentoo init scripts, specifically when it tries to bring up networking devices. It just hangs up when trying to initialize eth1, which is the interface that the NFS root filesystem is accessed through. My firewall script also kills things, but I should be able to fix that.

Having things running over NFS allows for live migration of running domains. I tried it out a few hours ago, and it's surprisingly painless, given that the appropriate functionality is enabled in the Xen daemon. One command sends a running domain between physical Xen hosts, which is pretty damned neat. I can see this being tremendously useful in a high-availabilty sort of environment. If a host machine needs maintenance, you can simply transfer the running child domain to another host, do your business, and transfer it back with only a fraction of a second of downtime.

Tags:

Related Tags

                                         


RSS Feed | Comments RSS Feed | Valid HTML 4.01 | Valid CSS
Memcache: Hits: 72 Misses: 3 Updates: 3 Deletes: 0 LocalHits: 51 Time: 0.0163
MySQL: Selects: 8 Inserts: 4 Updates: 0 Deletes: 0 Time: 1.2566
Page Render Time: 1.4865 seconds