Monday, June 23, 2008

SAN Nightmare, Part 7

Note: This is part 7 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

SAN Nightmare, Part 7: Eliminating the SAN

Tuesday, June 17

A plan for migration off of xSan and back to regularly formatted disks is born in the weekly IRT meeting. We decide to do the entire thing at once, over the weekend... and to split up our files onto two physical servers instead of one. The end result should be two file servers, each with approximately half of the data and half of the total users... and a total fill ratio of about 36% across all disks. Since we couldn't get an exact split of 50-50, we opted to give a slightly higher number of users to the computer that was faster and has more disk space.

Saturday, June 21

The migration began at 8:00 PM on Friday night. I also received my very own corporate credit card on Friday. During the course of the day Saturday, files were copied to the new disks. I spent as much time as I could at home, sitting out on my patio reading a book and getting tan. By some stroke of random, the temperature in La Jolla got up to 97 around 3pm. I got a little bit pink, but just the proper amount. It's already turned into a nice tan.

Just before bed, my phone starts chirping with SMS messages from one of our drive arrays. It appears as though Drive "Utica-6" has developed read/write errors and is no longer reliable. Thankfully, this is a RAID5 system... the data is protected and still intact. I go to bed and deal with it in the morning.

Sunday, June 22

Overnight, Utica-6 produced many, many more errors. It was time to replace it. I had exactly one spare drive on hand, which was used for this purpose. Time to put the new corporate card to use -- We needed to get more spare drives to have on-hand. Of course, the particular variant of the Hitachi DeskStar 7K500 drive that I needed to have a matched set is no longer manufactured... which posed a bit of a problem for finding an identical replacement drive. After a couple hours scouring Google and the internet at large, I was able to find ONE of these drives still in its new, unopened condition... in a warehouse in Canada. I also found a site that sells refurbished drives of this exact model. I ended up buying the single new drive along with two refurbs to replenish the stock of spare drives, and I bought some new tape for our P-Touch labeler while I was at it because I ran out over the weekend.

Monday, June 23

By Monday evening, we've gone one whole day without a single server or disk crash. This should not be big news, but given the recent string of events I'd say it's pretty impressive. Things seem stable, and for the most part speedy. The Utica array is still rebuilding and conditioning itself after the failure of drive 6, so files on that particular disk are a tiny bit slower than normal. Other than that, everything is fine. Our former SAN was made up of one file server, one metadata controller, and two dedicated backup target computers. We have since unwound that and turned it into three independent file servers. The fourth computer is four years old and has a bad FireWire controller, so I think its days of usefulness are over. Later this week, I plan to strip it for parts and upgrade our admin file server.

No comments: