Monday, June 23, 2008

SAN Nightmare, Part 8

Note: This is part 8 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

SAN Nightmare: Conclusions

In summary, xSan 2.0 sucks. Here's my general list of complaints:

- Undocumented steps in the upgrade process from 1.4.2 to 2.0 cause confusion and panic when users can't get their volumes to mount properly.

- Upgrade process introduced errors on one of our volumes that led to its eventual failure.

- Under 2.0, fsm process crashes randomly and far too often when folders on SAN volumes are re-shared over AFP/SMB and/or backed up with Retrospect.

- Under 2.1, fsm process segfaults in a similar manner to the crashes in 2.0. This can be easily reproduced by setting ACLs on an AFP/SMB shared volume and propagating permissions to all folders/subfolders under the top level of the share. Every time I try this, it crashes within 3 minutes.

- Under all versions, you cannot copy .mpkg files to an xSan volume over AFP. The volume crashes.

- Some programs do not allow you to open files directly on the server and edit them. Notable examples are EndNote and several Adobe apps. Instead, you have to copy the files to a local disk, edit them, and then copy them back to the server. This is annoying for users who keep their files on the server for safekeeping.

- Once an xSan volume crashes or becomes unstable, a computer reboot is often required to clear the memory and start fresh. If the volumes are mounted uncleanly, the OS will still think files are open and try to close them before restarting. Since the volume is not mounted, it is unable to do this. This causes a hang on restart that prevents the system from being rebooted/shut down gracefully. A force reboot is required. Forcing a power cycle through the rack PDU works quite well, but is not good for the server.

- fsm crashes typically force reboots of the metadata controller and/or the client hosting them. When the client is the file server, this causes issues for connected clients. When the metadata controller is affected, all other volumes are forced to failover while the controller reboots.

- Retrospect takes an incredibly long time to scan volumes for files and to determine whether files have been changed or not. Similarly, the actual backups of files themselves are slow. This seems to be the case no matter how fast your metadata controller is, but is significantly more pronounced when using older/slower computers as the metadata controller.

- Retrospect is unable to define sub-volumes of an xSan volume as backup targets because of the way the filesystem handles directory ID information. This forces Retrospect to scan the entire volume for a backup. On a 1.6 TB volume with 1TB of used space and 500,000 files, this can routinely take up to 20 hours to scan. On a normal HFS+ volume, this process takes mere minutes. This problem is compounded by the 4,000,000 file "limit" for Retrospect backup sets. Files are often marked as changed when they weren't and are re-backed up. That problem, combined with normal change/modify operations, means that a backup set can approach the 4,000,000 file limit easily within the course of its normal incremental backups before tape rotation.

- File copy and general file operations that require access to filesystem metadata are noticeably slower under xSan 2.0 compared to 1.4.2.

- The xSan Admin GUI for 2.0 is completely different from 1.4.2, and takes some re-learning to get used to. In version 2.0 of the GUI, it is also impossible to change a computer's role in the SAN from a Controller to a Client or vice-versa. Whatever the computer is added to the SAN as is what it must remain. I hear they fixed this in 2.1, but still... this is a very common thing that people do and it somehow got overlooked.

- If you open the xSAN Admin GUI on more than one computer, you occasionally get differing/conflicting information. This is most notable in the actual name of the SAN (inconsequential) but also shows up in places it should never report false information -- like where it tells you which metadata controller is currently controlling a specific volume. The cvadmin command line utility is so much better for most tasks, it's not even funny.

SAN Nightmare, Part 7

Note: This is part 7 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

SAN Nightmare, Part 7: Eliminating the SAN

Tuesday, June 17

A plan for migration off of xSan and back to regularly formatted disks is born in the weekly IRT meeting. We decide to do the entire thing at once, over the weekend... and to split up our files onto two physical servers instead of one. The end result should be two file servers, each with approximately half of the data and half of the total users... and a total fill ratio of about 36% across all disks. Since we couldn't get an exact split of 50-50, we opted to give a slightly higher number of users to the computer that was faster and has more disk space.

Saturday, June 21

The migration began at 8:00 PM on Friday night. I also received my very own corporate credit card on Friday. During the course of the day Saturday, files were copied to the new disks. I spent as much time as I could at home, sitting out on my patio reading a book and getting tan. By some stroke of random, the temperature in La Jolla got up to 97 around 3pm. I got a little bit pink, but just the proper amount. It's already turned into a nice tan.

Just before bed, my phone starts chirping with SMS messages from one of our drive arrays. It appears as though Drive "Utica-6" has developed read/write errors and is no longer reliable. Thankfully, this is a RAID5 system... the data is protected and still intact. I go to bed and deal with it in the morning.

Sunday, June 22

Overnight, Utica-6 produced many, many more errors. It was time to replace it. I had exactly one spare drive on hand, which was used for this purpose. Time to put the new corporate card to use -- We needed to get more spare drives to have on-hand. Of course, the particular variant of the Hitachi DeskStar 7K500 drive that I needed to have a matched set is no longer manufactured... which posed a bit of a problem for finding an identical replacement drive. After a couple hours scouring Google and the internet at large, I was able to find ONE of these drives still in its new, unopened condition... in a warehouse in Canada. I also found a site that sells refurbished drives of this exact model. I ended up buying the single new drive along with two refurbs to replenish the stock of spare drives, and I bought some new tape for our P-Touch labeler while I was at it because I ran out over the weekend.

Monday, June 23

By Monday evening, we've gone one whole day without a single server or disk crash. This should not be big news, but given the recent string of events I'd say it's pretty impressive. Things seem stable, and for the most part speedy. The Utica array is still rebuilding and conditioning itself after the failure of drive 6, so files on that particular disk are a tiny bit slower than normal. Other than that, everything is fine. Our former SAN was made up of one file server, one metadata controller, and two dedicated backup target computers. We have since unwound that and turned it into three independent file servers. The fourth computer is four years old and has a bad FireWire controller, so I think its days of usefulness are over. Later this week, I plan to strip it for parts and upgrade our admin file server.

SAN Nightmare, Part 6

Note: This is part 6 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 6: More Suffering

Monday, June 9

My boss announces that we are no longer allowed to use his corporate credit card to make purchases, because his assistant has transferred to another department and will no longer be reconciling the bill for him. Later this afternoon, someone decided to try to copy 200+ GB to their lab volume, which is hosted on one of the external drives. The drive fills to 100% capacity, and suddenly we are in need of an additional FreeAgent Pro. This minor crisis is the justification for having Accounting assign me my own corporate credit card.

Thursday, June 12

The 'san4' volume starts crashing repeatedly mid-day, much in the same way 'san5' did on May 27. This particular volume was the exclusive host of the file server data for the company we share the building with, and is therefore a critical part of our network. It becomes necessary for us to take this volume offline and move the data on it to a new drive as soon as possible. With no more FreeAgent Pro drives available, I had to use a lesser FireWire drive. The lack of eSATA slowed things down considerably. Meanwhile, we placed a rush order for another FreeAgent Pro drive, another eSATA card, and two internal hard drives for the server that we're turning into a dedicated server for this group.

Friday, June 13

The data copy was complete by about 1:00 PM. We had decided to take this chance to move this particular set of files to a dedicated server to avoid future problems like this, and because it makes things much cleaner for us from an administration standpoint. I began the migration to the new server at 6PM.

At 4:00 PM, the same instability issues previously experienced in the san5 and san4 volumes have spread to san6. We are forced to take two more labs offline to move them to the new FreeAgent Pro drive purchased the day before.

Both copy operations are successful and wrap up faster than anticipated, leaving some spare time over the weekend to come up with a more comprehensive plan for final removal of xSan from our network.

SAN Nightmare, Part 5

Note: This is part 5 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 5: Relative Stability

Thursday, May 29

A pre-planned trip to Sacramento couldn't have come at a better time. I thought I was going to be unable to go, but as it turns out, things are pretty stable for the two days before I leave. For the most part, they remain that way while I am gone. I think there was one instance where the server had to be rebooted.

Over the course of the weekend, Apple continues to work on finding a way to recover files from the original "san1" volume that crashed. They're still convinced there may be hope of recovering those files... which would go a long way toward making people not hate me so much. My understanding is that there was a fair amount of priceless research that was lost from that one week gap in the backups.

Thursday, June 5

Much of the early portion of the week is consumed by maintenance requests: finding specific files and folders that didn't restore properly, fixing file permission issues, etc. Just before noon, Apple declares our san1 volume to be officially unrecoverable. At this point, we are free to re-format the disks and start moving data back to them. We opt to use the space to do a little additional testing, first.

SAN Nightmare, Part 4

Note: This is part 4 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 4: Recovery

Wednesday, May 21

Restore operations have been ongoing since Saturday night, and the last batch of lab restores completed by the end of the day. A full week of downtime for some labs... and their files are being restored from the night of May 8. Ouch. The IRT and Public volumes are still offline. Restore operations at this point have taken a back seat to making sure the backup operations run properly.

Friday, May 23

After working with Apple over the course of the week, it was determined that a software patch was necessary to prevent recurring issues. In addition to the catastrophic failure of our "san1" volume, we had been experiencing other problems throughout the week with our other SAN volumes. Specifically, the would crash on a very regular basis, usually at least once a day. These crashes required frequent server reboots to address the issues. We were provided with a patch that contained a known fix to the known bug we were experiencing, which I installed in the evening while the server was offline for scheduled maintenance.

Almost immediately after installing this patch, I confirmed that the previous problem we had (random fsm crashes) seemed to be fixed... but that a new, much more serious problem had been introduced: random segmentation faults that corrupted the entire operating system. Oops. After running a few more diagnostics, we reverted the software back to the original less-buggy version and went from there.

Tuesday, May 27

The long holiday weekend gave me a chance to get the last volumes restored from tapes and back online. At this point, I finally get access to my files again -- I had been without them for 12 days, and work was piling up. By noon, another one of our SAN volumes (san5) had become very unstable -- three crashes in under an hour. Each time it crashed, it hung the file service processes as well -- file servers get angry when you disconnect the disks that hosted files live on without warning. We had to take the four labs living on this volume offline for the remainder of the day. By 9:00 PM, all data on the san5 volume had been moved to another Seagate FreeAgent Pro disk.

SAN Nightmare, Part 3

Note: This is part 3 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 3: Disaster Strikes

Thursday, May 15, 2008

At 10:00 AM, the "san1" volume, which is our largest volume with 5TB in total capacity and 11 of our 17 research divisions' (and IT) data, crashes and refuses to re-mount on either server. After a series of reboots and continued failures attempting to get the thing to mount, another call to Apple was in order. By noon, Apple determines that we should probably start restoring the data from backup tapes, since the prognosis does not look good. I start running some diagnostic tools for Apple while Michael loads the most current tapes in the tape library to begin the restore operation. At this time, we discover two things: (1) The volume "san1" is the only one that has not properly backed up in the last week (since the upgrade to xSan 2.0) due to the backup process crashing while running, and (2) The robotic arm on our tape loader has picked this exact moment, of all possible times, to fail. So, we have a drive that won't mount because it's corrupt, backups of that drive that are one week stale, and no way to read the stale tapes anyways because the robotics on the tape loader have failed. The first order of business, then, is getting the tape library working so we can read the tapes.
We put in a call to Quantum support about getting the library repaired and were told that since the device is out of warranty, they won't even talk to us until we purchase a support contract. The support contract department is out of the office for the day, so we should expect a call back tomorrow.

Feeling completely helpless, I decide to go home by 4:00 PM to get some dinner and some rest, because I have to be back in the office at 8:45 PM to start taking equipment offline for a large planned power outage. The impeccable timing of this disaster plus the planned outage kept me at the office until 2:45am.

Friday, May 16, 2008

By 8:00 AM, I'm back at the office running on a little less than 4 hours' sleep. Efforts to contact the folks at Quantum are unsuccessful all morning, so I leave to run other errands while Michael continues to try to talk to the contract folks. I head to Fry's to buy two 1TB Seagate FreeAgent Pro drives so we have somewhere to put the data once we start restoring it. (Side note: The FreeAgent Pro drives are eSATA/USB2/1394 and are awesome. I highly recommend them.) Also on the way back, a trip to Costco was in oder to pick up beer and desserts for the IRT-hosted Happy Hour that was scheduled for 4:00 PM. We had already booked and paid for the catering, so we couldn't cancel the thing. Another case of impeccable timing.

Shortly after returning to the office, we finally manage to get in ouch with Quantum regarding the support contract. We ended up paying a bit over $3000 for the "Gold" maintenance contract which entitles us to 24/7 on-site support. They diagnose the problem as a bad picker hand and schedule a courier to deliver the part by 4pm, and a technician to install the part by 6pm. Convenient: The IRT Happy Hour ran from 4-6pm.

The Quantum tech shows up, installs the new picker hand incorrectly, and continues to get the same error message as before on the Library. Then, he tweaks something and manages to run over the umbilical cord that connects the hand/picker to the rest of the library's electronics at about 8:30pm. Since it shot sparks, Quantum decided to send another umbilical out to us. I've been in the office for quite a while at this point, so I send the service tech home with instructions to come back in the morning. Quantum delivers the part to my apartment at about 11pm, just as I'm finishing up watching the season finale of The Office on my DVR.

Saturday, May 17, 2008

9:00 AM: Arrive at Office.
10:00 AM: Replacement umbilical cord installed. Same error.
10:30 AM: Service determines the problem is in the (already replaced) picker.
10:35 AM: Closest picker is in Irvine, it is being sent by courier to arrive at 2pm.
11:00 AM: Lunch.
2:30 PM: Picker arrives late due to bad traffic in Oceanside/Del Mar.
3:30 PM: Same error. All field-serviceable parts have been serviced. Quantum replacing entire chassis.
3:45 PM: Closest chassis is in downtown LA. Estimated arrival: 8:00 PM.
3:46 PM: I send the service tech home. I don't trust him anymore. I'll swap the chassis myself.
7:01 PM: Courier must have broken all sorts of speed laws to get chassis to us by 7pm.
10:30 PM: Restore operations begin. I go home.

Sunday, May 18, 2008
9:00 AM: Service tech returns to pick up the bad unit. The new one is working fine, thanks.

SAN Nightmare, Part 2

Note: This is part 2 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 2: Imminent Failure

Monday, May 12, 2008

In the morning, our LDAP server at work manages to get its internal account database corrupted. This server issue has absolutely no bearing on the xSan project other than its timing -- I ended up spending most of the day on Monday running around trying to fix other systems that were affected by the LDAP outage instead of paying attention to the backup scripts I'd started on Saturday afternoon.

Tuesday, May 13, 2008

Our weekly IRT meeting focused mainly on the LDAP failure from Monday, and how to better communicate things like downtime in the future. A brief wrap-up from the SAN migration over the weekend was presented, with the verdict that things looked good to this point. After the meeting, I checked on the backups and noticed the first problem: instead of speeding up the backup window, xSan seemed to be lengthening it dramatically. We had six SAN volumes, each are supposed to back up nightly. We had one dedicated computer to run backups and the ability to add a second if necessary which would give us two simultaneous backup processes at most. The "san5" volume was single-handedly taking about 27 hours to run an incremental backup... on its own. As a result, our other volumes are being skipped over for backups because the process is taking so long. I make a few changes and set up the "san1" volume to start a backup operation.

Later in the afternoon, we start having odd problems with some of our share points on the server. It turns out that people in specific labs aren't able to connect to their files, because the volume their data resides on has un-mounted itself from the file server. The odd thing was that I couldn't get it to re-mount on that computer -- but, it would mount on the "spare" server just fine. Over the course of the afternoon, I re-configured al the server share points onto the new hardware and moved the DNS records over. This allowed everyone to connect to the new box using the same server names. Everything seemed happy.

SAN Nightmare, Part 1

Note: This is part 1 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 1: The Upgrade

Friday, May 9, 2008

On the evening of Friday, May 9, I scheduled downtime on the file server to upgrade our version of xSan from 1.4.2 to 2.0. I had hoped the new version would fix some random, nagging problems we'd been having with the software such as occasional unannounced server reboots and problems with certain types of files. The random reboot thing was happening more or less on a weekly basis and seemed to coincide with some larger backup operations we were doing. The upgrade was also (hopefully) going to help our backup server more effectively back up the data on the SAN by improving read/copy speeds.

I downloaded the migration guides and read them over prior to starting the upgrade. The guide mentioned the need to wait a period of time (sometimes a few hours) for the volumes to update their metadata to the new 2.0 format before they would be available. I installed the software and noticed the volumes were showing up in the GUI admin tool as being available after a few minutes, and didn't think much of it until I went to try to start/mount some of them and began to receive errors. After a bunch of times where I froze up the GUI by trying to start a volume and had to force reboot the server, I decided to call Apple. The support rep on the phone mentioned something that was not notated anywhere in the migration guide at all: you have to just let the upgrade run its course before trying to start the volumes. Problem: there is no progress bar that tells you (a) whether the update has started, (b) whether it is running, or (c) when it is done. This update is all done silently in the background, and can take "hours" depending on what exactly you're storing there. To determine whether the RPL update is done you have to go hunting through the system logs for a very specific (undocumented) file and search for the string that denotes entries related to the upgrade. Thanks for documenting that, Apple.

Saturday, May 10, 2008

After the support call with Apple from the previous night, I decided to just let the "RPL update" run its course overnight, and come back in the morning to see how things looked. With all the RPL updates done and all our volumes mounting properly, things seemed to be in good shape. I brought the server back online at about 2pm, re-configured the backup routines, and told them to start backing up the server.

SAN Nightmare, Part 0

Note: This is part 0 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 0: Introduction

This series of entries timelines select events that happened between May 9 and June 23, 2008. The purpose of this series of entries is primarily so I can remember how bad the last two months have really been. In the process, maybe someone will stumble along this and decide that xSAN has as the potential to have as much of a "distaste for your environment" (Apple's words) as it did for mine.

A bit of background: We had been running xSAN 1.4.2 software since late November, 2007. Due to some issues we had with it (explained later on) we decided to upgrade to the new version in hopes of a fix. When we originally implemented the xSAN solution, we did so because we were intrigued by the idea of allowing multiple servers to share a single large pool of disk space. This would, in theory, allow us to do things like sharing a single public folder across several servers or move groups of people from one server to another for load-balancing reasons without having to move their data. Furthermore, it allowed us to set up a model in which a computer on the SAN was dedicated specifically to being the computer that Retrospect sent all its backup requests through. Structuring backups in this way freed up a considerable amount of CPU space on the file server itself to do things such as serve files in a timely fashion.

Sunday, June 22, 2008

New Apartment, Part 3

As promised in the previous post, pictures have been posted of the apartment with all the furniture where it now lives. Additionally, photos and art have been installed on (most of) the walls. Things are looking better, but still need some work.

Also included are a few quick snapshots of the garden I planted last weekend. Enjoy!

Link to Photo Gallery

Saturday, June 21, 2008

New Apartment - Furniture

After a very long delay, I have finally gotten around to uploading pictures of my new apartment with actual, real furniture in it. Between moving, unpacking, decorating, a crazy two months at work, and trying to actually be social, I haven't had time to bother getting these things uploaded until now.

Most of this gallery consists of photos taken during the "unpacking and settling in" phase, so a lot of the furniture is no longer arranged as shown in the photos. Another set of pictures will follow with the "final" furniture layout.

Click Here to link to the photo gallery.