Monday, June 23, 2008

SAN Nightmare, Part 2

Note: This is part 2 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 2: Imminent Failure

Monday, May 12, 2008

In the morning, our LDAP server at work manages to get its internal account database corrupted. This server issue has absolutely no bearing on the xSan project other than its timing -- I ended up spending most of the day on Monday running around trying to fix other systems that were affected by the LDAP outage instead of paying attention to the backup scripts I'd started on Saturday afternoon.

Tuesday, May 13, 2008

Our weekly IRT meeting focused mainly on the LDAP failure from Monday, and how to better communicate things like downtime in the future. A brief wrap-up from the SAN migration over the weekend was presented, with the verdict that things looked good to this point. After the meeting, I checked on the backups and noticed the first problem: instead of speeding up the backup window, xSan seemed to be lengthening it dramatically. We had six SAN volumes, each are supposed to back up nightly. We had one dedicated computer to run backups and the ability to add a second if necessary which would give us two simultaneous backup processes at most. The "san5" volume was single-handedly taking about 27 hours to run an incremental backup... on its own. As a result, our other volumes are being skipped over for backups because the process is taking so long. I make a few changes and set up the "san1" volume to start a backup operation.

Later in the afternoon, we start having odd problems with some of our share points on the server. It turns out that people in specific labs aren't able to connect to their files, because the volume their data resides on has un-mounted itself from the file server. The odd thing was that I couldn't get it to re-mount on that computer -- but, it would mount on the "spare" server just fine. Over the course of the afternoon, I re-configured al the server share points onto the new hardware and moved the DNS records over. This allowed everyone to connect to the new box using the same server names. Everything seemed happy.

No comments: