Monday, June 23, 2008

SAN Nightmare, Part 3

Note: This is part 3 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 3: Disaster Strikes

Thursday, May 15, 2008

At 10:00 AM, the "san1" volume, which is our largest volume with 5TB in total capacity and 11 of our 17 research divisions' (and IT) data, crashes and refuses to re-mount on either server. After a series of reboots and continued failures attempting to get the thing to mount, another call to Apple was in order. By noon, Apple determines that we should probably start restoring the data from backup tapes, since the prognosis does not look good. I start running some diagnostic tools for Apple while Michael loads the most current tapes in the tape library to begin the restore operation. At this time, we discover two things: (1) The volume "san1" is the only one that has not properly backed up in the last week (since the upgrade to xSan 2.0) due to the backup process crashing while running, and (2) The robotic arm on our tape loader has picked this exact moment, of all possible times, to fail. So, we have a drive that won't mount because it's corrupt, backups of that drive that are one week stale, and no way to read the stale tapes anyways because the robotics on the tape loader have failed. The first order of business, then, is getting the tape library working so we can read the tapes.
We put in a call to Quantum support about getting the library repaired and were told that since the device is out of warranty, they won't even talk to us until we purchase a support contract. The support contract department is out of the office for the day, so we should expect a call back tomorrow.

Feeling completely helpless, I decide to go home by 4:00 PM to get some dinner and some rest, because I have to be back in the office at 8:45 PM to start taking equipment offline for a large planned power outage. The impeccable timing of this disaster plus the planned outage kept me at the office until 2:45am.

Friday, May 16, 2008

By 8:00 AM, I'm back at the office running on a little less than 4 hours' sleep. Efforts to contact the folks at Quantum are unsuccessful all morning, so I leave to run other errands while Michael continues to try to talk to the contract folks. I head to Fry's to buy two 1TB Seagate FreeAgent Pro drives so we have somewhere to put the data once we start restoring it. (Side note: The FreeAgent Pro drives are eSATA/USB2/1394 and are awesome. I highly recommend them.) Also on the way back, a trip to Costco was in oder to pick up beer and desserts for the IRT-hosted Happy Hour that was scheduled for 4:00 PM. We had already booked and paid for the catering, so we couldn't cancel the thing. Another case of impeccable timing.

Shortly after returning to the office, we finally manage to get in ouch with Quantum regarding the support contract. We ended up paying a bit over $3000 for the "Gold" maintenance contract which entitles us to 24/7 on-site support. They diagnose the problem as a bad picker hand and schedule a courier to deliver the part by 4pm, and a technician to install the part by 6pm. Convenient: The IRT Happy Hour ran from 4-6pm.

The Quantum tech shows up, installs the new picker hand incorrectly, and continues to get the same error message as before on the Library. Then, he tweaks something and manages to run over the umbilical cord that connects the hand/picker to the rest of the library's electronics at about 8:30pm. Since it shot sparks, Quantum decided to send another umbilical out to us. I've been in the office for quite a while at this point, so I send the service tech home with instructions to come back in the morning. Quantum delivers the part to my apartment at about 11pm, just as I'm finishing up watching the season finale of The Office on my DVR.

Saturday, May 17, 2008

9:00 AM: Arrive at Office.
10:00 AM: Replacement umbilical cord installed. Same error.
10:30 AM: Service determines the problem is in the (already replaced) picker.
10:35 AM: Closest picker is in Irvine, it is being sent by courier to arrive at 2pm.
11:00 AM: Lunch.
2:30 PM: Picker arrives late due to bad traffic in Oceanside/Del Mar.
3:30 PM: Same error. All field-serviceable parts have been serviced. Quantum replacing entire chassis.
3:45 PM: Closest chassis is in downtown LA. Estimated arrival: 8:00 PM.
3:46 PM: I send the service tech home. I don't trust him anymore. I'll swap the chassis myself.
7:01 PM: Courier must have broken all sorts of speed laws to get chassis to us by 7pm.
10:30 PM: Restore operations begin. I go home.

Sunday, May 18, 2008
9:00 AM: Service tech returns to pick up the bad unit. The new one is working fine, thanks.

No comments: