Monday, June 23, 2008

SAN Nightmare, Part 8

Note: This is part 8 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

SAN Nightmare: Conclusions

In summary, xSan 2.0 sucks. Here's my general list of complaints:

- Undocumented steps in the upgrade process from 1.4.2 to 2.0 cause confusion and panic when users can't get their volumes to mount properly.

- Upgrade process introduced errors on one of our volumes that led to its eventual failure.

- Under 2.0, fsm process crashes randomly and far too often when folders on SAN volumes are re-shared over AFP/SMB and/or backed up with Retrospect.

- Under 2.1, fsm process segfaults in a similar manner to the crashes in 2.0. This can be easily reproduced by setting ACLs on an AFP/SMB shared volume and propagating permissions to all folders/subfolders under the top level of the share. Every time I try this, it crashes within 3 minutes.

- Under all versions, you cannot copy .mpkg files to an xSan volume over AFP. The volume crashes.

- Some programs do not allow you to open files directly on the server and edit them. Notable examples are EndNote and several Adobe apps. Instead, you have to copy the files to a local disk, edit them, and then copy them back to the server. This is annoying for users who keep their files on the server for safekeeping.

- Once an xSan volume crashes or becomes unstable, a computer reboot is often required to clear the memory and start fresh. If the volumes are mounted uncleanly, the OS will still think files are open and try to close them before restarting. Since the volume is not mounted, it is unable to do this. This causes a hang on restart that prevents the system from being rebooted/shut down gracefully. A force reboot is required. Forcing a power cycle through the rack PDU works quite well, but is not good for the server.

- fsm crashes typically force reboots of the metadata controller and/or the client hosting them. When the client is the file server, this causes issues for connected clients. When the metadata controller is affected, all other volumes are forced to failover while the controller reboots.

- Retrospect takes an incredibly long time to scan volumes for files and to determine whether files have been changed or not. Similarly, the actual backups of files themselves are slow. This seems to be the case no matter how fast your metadata controller is, but is significantly more pronounced when using older/slower computers as the metadata controller.

- Retrospect is unable to define sub-volumes of an xSan volume as backup targets because of the way the filesystem handles directory ID information. This forces Retrospect to scan the entire volume for a backup. On a 1.6 TB volume with 1TB of used space and 500,000 files, this can routinely take up to 20 hours to scan. On a normal HFS+ volume, this process takes mere minutes. This problem is compounded by the 4,000,000 file "limit" for Retrospect backup sets. Files are often marked as changed when they weren't and are re-backed up. That problem, combined with normal change/modify operations, means that a backup set can approach the 4,000,000 file limit easily within the course of its normal incremental backups before tape rotation.

- File copy and general file operations that require access to filesystem metadata are noticeably slower under xSan 2.0 compared to 1.4.2.

- The xSan Admin GUI for 2.0 is completely different from 1.4.2, and takes some re-learning to get used to. In version 2.0 of the GUI, it is also impossible to change a computer's role in the SAN from a Controller to a Client or vice-versa. Whatever the computer is added to the SAN as is what it must remain. I hear they fixed this in 2.1, but still... this is a very common thing that people do and it somehow got overlooked.

- If you open the xSAN Admin GUI on more than one computer, you occasionally get differing/conflicting information. This is most notable in the actual name of the SAN (inconsequential) but also shows up in places it should never report false information -- like where it tells you which metadata controller is currently controlling a specific volume. The cvadmin command line utility is so much better for most tasks, it's not even funny.

No comments: