Monday, June 23, 2008

SAN Nightmare, Part 4

Note: This is part 4 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 4: Recovery

Wednesday, May 21

Restore operations have been ongoing since Saturday night, and the last batch of lab restores completed by the end of the day. A full week of downtime for some labs... and their files are being restored from the night of May 8. Ouch. The IRT and Public volumes are still offline. Restore operations at this point have taken a back seat to making sure the backup operations run properly.

Friday, May 23

After working with Apple over the course of the week, it was determined that a software patch was necessary to prevent recurring issues. In addition to the catastrophic failure of our "san1" volume, we had been experiencing other problems throughout the week with our other SAN volumes. Specifically, the would crash on a very regular basis, usually at least once a day. These crashes required frequent server reboots to address the issues. We were provided with a patch that contained a known fix to the known bug we were experiencing, which I installed in the evening while the server was offline for scheduled maintenance.

Almost immediately after installing this patch, I confirmed that the previous problem we had (random fsm crashes) seemed to be fixed... but that a new, much more serious problem had been introduced: random segmentation faults that corrupted the entire operating system. Oops. After running a few more diagnostics, we reverted the software back to the original less-buggy version and went from there.

Tuesday, May 27

The long holiday weekend gave me a chance to get the last volumes restored from tapes and back online. At this point, I finally get access to my files again -- I had been without them for 12 days, and work was piling up. By noon, another one of our SAN volumes (san5) had become very unstable -- three crashes in under an hour. Each time it crashed, it hung the file service processes as well -- file servers get angry when you disconnect the disks that hosted files live on without warning. We had to take the four labs living on this volume offline for the remainder of the day. By 9:00 PM, all data on the san5 volume had been moved to another Seagate FreeAgent Pro disk.

No comments: