Monday, June 23, 2008

SAN Nightmare, Part 1

Note: This is part 1 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 1: The Upgrade

Friday, May 9, 2008

On the evening of Friday, May 9, I scheduled downtime on the file server to upgrade our version of xSan from 1.4.2 to 2.0. I had hoped the new version would fix some random, nagging problems we'd been having with the software such as occasional unannounced server reboots and problems with certain types of files. The random reboot thing was happening more or less on a weekly basis and seemed to coincide with some larger backup operations we were doing. The upgrade was also (hopefully) going to help our backup server more effectively back up the data on the SAN by improving read/copy speeds.

I downloaded the migration guides and read them over prior to starting the upgrade. The guide mentioned the need to wait a period of time (sometimes a few hours) for the volumes to update their metadata to the new 2.0 format before they would be available. I installed the software and noticed the volumes were showing up in the GUI admin tool as being available after a few minutes, and didn't think much of it until I went to try to start/mount some of them and began to receive errors. After a bunch of times where I froze up the GUI by trying to start a volume and had to force reboot the server, I decided to call Apple. The support rep on the phone mentioned something that was not notated anywhere in the migration guide at all: you have to just let the upgrade run its course before trying to start the volumes. Problem: there is no progress bar that tells you (a) whether the update has started, (b) whether it is running, or (c) when it is done. This update is all done silently in the background, and can take "hours" depending on what exactly you're storing there. To determine whether the RPL update is done you have to go hunting through the system logs for a very specific (undocumented) file and search for the string that denotes entries related to the upgrade. Thanks for documenting that, Apple.

Saturday, May 10, 2008

After the support call with Apple from the previous night, I decided to just let the "RPL update" run its course overnight, and come back in the morning to see how things looked. With all the RPL updates done and all our volumes mounting properly, things seemed to be in good shape. I brought the server back online at about 2pm, re-configured the backup routines, and told them to start backing up the server.

No comments: