Monday, June 23, 2008

SAN Nightmare, Part 7

Note: This is part 7 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

SAN Nightmare, Part 7: Eliminating the SAN

Tuesday, June 17

A plan for migration off of xSan and back to regularly formatted disks is born in the weekly IRT meeting. We decide to do the entire thing at once, over the weekend... and to split up our files onto two physical servers instead of one. The end result should be two file servers, each with approximately half of the data and half of the total users... and a total fill ratio of about 36% across all disks. Since we couldn't get an exact split of 50-50, we opted to give a slightly higher number of users to the computer that was faster and has more disk space.

Saturday, June 21

The migration began at 8:00 PM on Friday night. I also received my very own corporate credit card on Friday. During the course of the day Saturday, files were copied to the new disks. I spent as much time as I could at home, sitting out on my patio reading a book and getting tan. By some stroke of random, the temperature in La Jolla got up to 97 around 3pm. I got a little bit pink, but just the proper amount. It's already turned into a nice tan.

Just before bed, my phone starts chirping with SMS messages from one of our drive arrays. It appears as though Drive "Utica-6" has developed read/write errors and is no longer reliable. Thankfully, this is a RAID5 system... the data is protected and still intact. I go to bed and deal with it in the morning.

Sunday, June 22

Overnight, Utica-6 produced many, many more errors. It was time to replace it. I had exactly one spare drive on hand, which was used for this purpose. Time to put the new corporate card to use -- We needed to get more spare drives to have on-hand. Of course, the particular variant of the Hitachi DeskStar 7K500 drive that I needed to have a matched set is no longer manufactured... which posed a bit of a problem for finding an identical replacement drive. After a couple hours scouring Google and the internet at large, I was able to find ONE of these drives still in its new, unopened condition... in a warehouse in Canada. I also found a site that sells refurbished drives of this exact model. I ended up buying the single new drive along with two refurbs to replenish the stock of spare drives, and I bought some new tape for our P-Touch labeler while I was at it because I ran out over the weekend.

Monday, June 23

By Monday evening, we've gone one whole day without a single server or disk crash. This should not be big news, but given the recent string of events I'd say it's pretty impressive. Things seem stable, and for the most part speedy. The Utica array is still rebuilding and conditioning itself after the failure of drive 6, so files on that particular disk are a tiny bit slower than normal. Other than that, everything is fine. Our former SAN was made up of one file server, one metadata controller, and two dedicated backup target computers. We have since unwound that and turned it into three independent file servers. The fourth computer is four years old and has a bad FireWire controller, so I think its days of usefulness are over. Later this week, I plan to strip it for parts and upgrade our admin file server.

SAN Nightmare, Part 6

Note: This is part 6 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 6: More Suffering

Monday, June 9

My boss announces that we are no longer allowed to use his corporate credit card to make purchases, because his assistant has transferred to another department and will no longer be reconciling the bill for him. Later this afternoon, someone decided to try to copy 200+ GB to their lab volume, which is hosted on one of the external drives. The drive fills to 100% capacity, and suddenly we are in need of an additional FreeAgent Pro. This minor crisis is the justification for having Accounting assign me my own corporate credit card.

Thursday, June 12

The 'san4' volume starts crashing repeatedly mid-day, much in the same way 'san5' did on May 27. This particular volume was the exclusive host of the file server data for the company we share the building with, and is therefore a critical part of our network. It becomes necessary for us to take this volume offline and move the data on it to a new drive as soon as possible. With no more FreeAgent Pro drives available, I had to use a lesser FireWire drive. The lack of eSATA slowed things down considerably. Meanwhile, we placed a rush order for another FreeAgent Pro drive, another eSATA card, and two internal hard drives for the server that we're turning into a dedicated server for this group.

Friday, June 13

The data copy was complete by about 1:00 PM. We had decided to take this chance to move this particular set of files to a dedicated server to avoid future problems like this, and because it makes things much cleaner for us from an administration standpoint. I began the migration to the new server at 6PM.

At 4:00 PM, the same instability issues previously experienced in the san5 and san4 volumes have spread to san6. We are forced to take two more labs offline to move them to the new FreeAgent Pro drive purchased the day before.

Both copy operations are successful and wrap up faster than anticipated, leaving some spare time over the weekend to come up with a more comprehensive plan for final removal of xSan from our network.

SAN Nightmare, Part 5

Note: This is part 5 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 5: Relative Stability

Thursday, May 29

A pre-planned trip to Sacramento couldn't have come at a better time. I thought I was going to be unable to go, but as it turns out, things are pretty stable for the two days before I leave. For the most part, they remain that way while I am gone. I think there was one instance where the server had to be rebooted.

Over the course of the weekend, Apple continues to work on finding a way to recover files from the original "san1" volume that crashed. They're still convinced there may be hope of recovering those files... which would go a long way toward making people not hate me so much. My understanding is that there was a fair amount of priceless research that was lost from that one week gap in the backups.

Thursday, June 5

Much of the early portion of the week is consumed by maintenance requests: finding specific files and folders that didn't restore properly, fixing file permission issues, etc. Just before noon, Apple declares our san1 volume to be officially unrecoverable. At this point, we are free to re-format the disks and start moving data back to them. We opt to use the space to do a little additional testing, first.

SAN Nightmare, Part 4

Note: This is part 4 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 4: Recovery

Wednesday, May 21

Restore operations have been ongoing since Saturday night, and the last batch of lab restores completed by the end of the day. A full week of downtime for some labs... and their files are being restored from the night of May 8. Ouch. The IRT and Public volumes are still offline. Restore operations at this point have taken a back seat to making sure the backup operations run properly.

Friday, May 23

After working with Apple over the course of the week, it was determined that a software patch was necessary to prevent recurring issues. In addition to the catastrophic failure of our "san1" volume, we had been experiencing other problems throughout the week with our other SAN volumes. Specifically, the would crash on a very regular basis, usually at least once a day. These crashes required frequent server reboots to address the issues. We were provided with a patch that contained a known fix to the known bug we were experiencing, which I installed in the evening while the server was offline for scheduled maintenance.

Almost immediately after installing this patch, I confirmed that the previous problem we had (random fsm crashes) seemed to be fixed... but that a new, much more serious problem had been introduced: random segmentation faults that corrupted the entire operating system. Oops. After running a few more diagnostics, we reverted the software back to the original less-buggy version and went from there.

Tuesday, May 27

The long holiday weekend gave me a chance to get the last volumes restored from tapes and back online. At this point, I finally get access to my files again -- I had been without them for 12 days, and work was piling up. By noon, another one of our SAN volumes (san5) had become very unstable -- three crashes in under an hour. Each time it crashed, it hung the file service processes as well -- file servers get angry when you disconnect the disks that hosted files live on without warning. We had to take the four labs living on this volume offline for the remainder of the day. By 9:00 PM, all data on the san5 volume had been moved to another Seagate FreeAgent Pro disk.

SAN Nightmare, Part 3

Note: This is part 3 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 3: Disaster Strikes

Thursday, May 15, 2008

At 10:00 AM, the "san1" volume, which is our largest volume with 5TB in total capacity and 11 of our 17 research divisions' (and IT) data, crashes and refuses to re-mount on either server. After a series of reboots and continued failures attempting to get the thing to mount, another call to Apple was in order. By noon, Apple determines that we should probably start restoring the data from backup tapes, since the prognosis does not look good. I start running some diagnostic tools for Apple while Michael loads the most current tapes in the tape library to begin the restore operation. At this time, we discover two things: (1) The volume "san1" is the only one that has not properly backed up in the last week (since the upgrade to xSan 2.0) due to the backup process crashing while running, and (2) The robotic arm on our tape loader has picked this exact moment, of all possible times, to fail. So, we have a drive that won't mount because it's corrupt, backups of that drive that are one week stale, and no way to read the stale tapes anyways because the robotics on the tape loader have failed. The first order of business, then, is getting the tape library working so we can read the tapes.
We put in a call to Quantum support about getting the library repaired and were told that since the device is out of warranty, they won't even talk to us until we purchase a support contract. The support contract department is out of the office for the day, so we should expect a call back tomorrow.

Feeling completely helpless, I decide to go home by 4:00 PM to get some dinner and some rest, because I have to be back in the office at 8:45 PM to start taking equipment offline for a large planned power outage. The impeccable timing of this disaster plus the planned outage kept me at the office until 2:45am.

Friday, May 16, 2008

By 8:00 AM, I'm back at the office running on a little less than 4 hours' sleep. Efforts to contact the folks at Quantum are unsuccessful all morning, so I leave to run other errands while Michael continues to try to talk to the contract folks. I head to Fry's to buy two 1TB Seagate FreeAgent Pro drives so we have somewhere to put the data once we start restoring it. (Side note: The FreeAgent Pro drives are eSATA/USB2/1394 and are awesome. I highly recommend them.) Also on the way back, a trip to Costco was in oder to pick up beer and desserts for the IRT-hosted Happy Hour that was scheduled for 4:00 PM. We had already booked and paid for the catering, so we couldn't cancel the thing. Another case of impeccable timing.

Shortly after returning to the office, we finally manage to get in ouch with Quantum regarding the support contract. We ended up paying a bit over $3000 for the "Gold" maintenance contract which entitles us to 24/7 on-site support. They diagnose the problem as a bad picker hand and schedule a courier to deliver the part by 4pm, and a technician to install the part by 6pm. Convenient: The IRT Happy Hour ran from 4-6pm.

The Quantum tech shows up, installs the new picker hand incorrectly, and continues to get the same error message as before on the Library. Then, he tweaks something and manages to run over the umbilical cord that connects the hand/picker to the rest of the library's electronics at about 8:30pm. Since it shot sparks, Quantum decided to send another umbilical out to us. I've been in the office for quite a while at this point, so I send the service tech home with instructions to come back in the morning. Quantum delivers the part to my apartment at about 11pm, just as I'm finishing up watching the season finale of The Office on my DVR.

Saturday, May 17, 2008

9:00 AM: Arrive at Office.
10:00 AM: Replacement umbilical cord installed. Same error.
10:30 AM: Service determines the problem is in the (already replaced) picker.
10:35 AM: Closest picker is in Irvine, it is being sent by courier to arrive at 2pm.
11:00 AM: Lunch.
2:30 PM: Picker arrives late due to bad traffic in Oceanside/Del Mar.
3:30 PM: Same error. All field-serviceable parts have been serviced. Quantum replacing entire chassis.
3:45 PM: Closest chassis is in downtown LA. Estimated arrival: 8:00 PM.
3:46 PM: I send the service tech home. I don't trust him anymore. I'll swap the chassis myself.
7:01 PM: Courier must have broken all sorts of speed laws to get chassis to us by 7pm.
10:30 PM: Restore operations begin. I go home.

Sunday, May 18, 2008
9:00 AM: Service tech returns to pick up the bad unit. The new one is working fine, thanks.

SAN Nightmare, Part 2

Note: This is part 2 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 2: Imminent Failure

Monday, May 12, 2008

In the morning, our LDAP server at work manages to get its internal account database corrupted. This server issue has absolutely no bearing on the xSan project other than its timing -- I ended up spending most of the day on Monday running around trying to fix other systems that were affected by the LDAP outage instead of paying attention to the backup scripts I'd started on Saturday afternoon.

Tuesday, May 13, 2008

Our weekly IRT meeting focused mainly on the LDAP failure from Monday, and how to better communicate things like downtime in the future. A brief wrap-up from the SAN migration over the weekend was presented, with the verdict that things looked good to this point. After the meeting, I checked on the backups and noticed the first problem: instead of speeding up the backup window, xSan seemed to be lengthening it dramatically. We had six SAN volumes, each are supposed to back up nightly. We had one dedicated computer to run backups and the ability to add a second if necessary which would give us two simultaneous backup processes at most. The "san5" volume was single-handedly taking about 27 hours to run an incremental backup... on its own. As a result, our other volumes are being skipped over for backups because the process is taking so long. I make a few changes and set up the "san1" volume to start a backup operation.

Later in the afternoon, we start having odd problems with some of our share points on the server. It turns out that people in specific labs aren't able to connect to their files, because the volume their data resides on has un-mounted itself from the file server. The odd thing was that I couldn't get it to re-mount on that computer -- but, it would mount on the "spare" server just fine. Over the course of the afternoon, I re-configured al the server share points onto the new hardware and moved the DNS records over. This allowed everyone to connect to the new box using the same server names. Everything seemed happy.

SAN Nightmare, Part 1

Note: This is part 1 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 1: The Upgrade

Friday, May 9, 2008

On the evening of Friday, May 9, I scheduled downtime on the file server to upgrade our version of xSan from 1.4.2 to 2.0. I had hoped the new version would fix some random, nagging problems we'd been having with the software such as occasional unannounced server reboots and problems with certain types of files. The random reboot thing was happening more or less on a weekly basis and seemed to coincide with some larger backup operations we were doing. The upgrade was also (hopefully) going to help our backup server more effectively back up the data on the SAN by improving read/copy speeds.

I downloaded the migration guides and read them over prior to starting the upgrade. The guide mentioned the need to wait a period of time (sometimes a few hours) for the volumes to update their metadata to the new 2.0 format before they would be available. I installed the software and noticed the volumes were showing up in the GUI admin tool as being available after a few minutes, and didn't think much of it until I went to try to start/mount some of them and began to receive errors. After a bunch of times where I froze up the GUI by trying to start a volume and had to force reboot the server, I decided to call Apple. The support rep on the phone mentioned something that was not notated anywhere in the migration guide at all: you have to just let the upgrade run its course before trying to start the volumes. Problem: there is no progress bar that tells you (a) whether the update has started, (b) whether it is running, or (c) when it is done. This update is all done silently in the background, and can take "hours" depending on what exactly you're storing there. To determine whether the RPL update is done you have to go hunting through the system logs for a very specific (undocumented) file and search for the string that denotes entries related to the upgrade. Thanks for documenting that, Apple.

Saturday, May 10, 2008

After the support call with Apple from the previous night, I decided to just let the "RPL update" run its course overnight, and come back in the morning to see how things looked. With all the RPL updates done and all our volumes mounting properly, things seemed to be in good shape. I brought the server back online at about 2pm, re-configured the backup routines, and told them to start backing up the server.

SAN Nightmare, Part 0

Note: This is part 0 of an 8 part series. Read them in order, it'll make more sense. Part 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Part 0: Introduction

This series of entries timelines select events that happened between May 9 and June 23, 2008. The purpose of this series of entries is primarily so I can remember how bad the last two months have really been. In the process, maybe someone will stumble along this and decide that xSAN has as the potential to have as much of a "distaste for your environment" (Apple's words) as it did for mine.

A bit of background: We had been running xSAN 1.4.2 software since late November, 2007. Due to some issues we had with it (explained later on) we decided to upgrade to the new version in hopes of a fix. When we originally implemented the xSAN solution, we did so because we were intrigued by the idea of allowing multiple servers to share a single large pool of disk space. This would, in theory, allow us to do things like sharing a single public folder across several servers or move groups of people from one server to another for load-balancing reasons without having to move their data. Furthermore, it allowed us to set up a model in which a computer on the SAN was dedicated specifically to being the computer that Retrospect sent all its backup requests through. Structuring backups in this way freed up a considerable amount of CPU space on the file server itself to do things such as serve files in a timely fashion.

Sunday, June 22, 2008

New Apartment, Part 3

As promised in the previous post, pictures have been posted of the apartment with all the furniture where it now lives. Additionally, photos and art have been installed on (most of) the walls. Things are looking better, but still need some work.

Also included are a few quick snapshots of the garden I planted last weekend. Enjoy!

Link to Photo Gallery

Saturday, June 21, 2008

New Apartment - Furniture

After a very long delay, I have finally gotten around to uploading pictures of my new apartment with actual, real furniture in it. Between moving, unpacking, decorating, a crazy two months at work, and trying to actually be social, I haven't had time to bother getting these things uploaded until now.

Most of this gallery consists of photos taken during the "unpacking and settling in" phase, so a lot of the furniture is no longer arranged as shown in the photos. Another set of pictures will follow with the "final" furniture layout.

Click Here to link to the photo gallery.

Thursday, April 24, 2008

Anza-Borrego Pictures Uploaded

See the entry below for pictures of my trip to Anza-Borrego back on Easter to take pictures of wildflowers.

Friday, April 4, 2008

Padres Game

For being such a great customer, Cox gave me a stack of tickets to the Padres' Friday Night game of Opening Week against the Dodgers. In the Cox suite. These are some pretty nice digs for a baseball game. If you've never been to Petco Park or sat in the Garden Level suites... they are right behind home plate and at a perfect height to see the entire field (and catch fly balls.)

View from our seats.

After the unfortunate (blowout) game, there was a special fireworks show for what the Padres were calling "Military Opening Night." Some of the photos that I took of that ended up coming out alright, as well.

Postgame fireworks display

In the end, a good time was had by all. Hopefully, I'll get tickets to another game soon!

Tuesday, April 1, 2008

New Apartment

I went over to the new building at Crossroads on opening day, a week before I was supposed to get my keys and move into my new place. The maintenance staff hadn't completed inspecting and finishing up my apartment... and accidentally left it unlocked. I took the opportunity to go in with the camera and tape measure and take lots of pictures (and lots of measurements.)

Photos of the (empty) new apartment are here.

Photos of the place with furniture will come later.

Sunday, March 23, 2008

Anza-Borrego

I took a nice Easter drive out to Anza-Borrego today to do some wildflower viewing. Also did some off-paved-roading in my 2-wheel-drive low-clearance car and almost got stuck in some sand. Saw a Mustang that did get stuck in some sand. Despite the fact there were several vehicles that stopped to help, one of which had towing ropes... the ladies who were driving the Mustang refused help of any sort and insisted on waiting for a tow truck to come from an hour away to help them.

Update 24-Apr-2008: Pictures are online! Click on the pic below for the link to the gallery.

Tuesday, March 18, 2008

Small Victories

Time Warner updated the operating software on all of their cable boxes in my area this week. It happened to the standard-def box I have in my bedroom first, and I was quite annoyed when it happened because it managed to delete (seemingly randomly) about 2/3 of the scheduled series recordings I had set up.

This morning, I noticed my HD box in the living room had been updated. Same annoying problem. But, a new behavior out of the cable box: on non-HD channels, instead of putting up stupid, ugly gray vertical bars on either side of the 4:3 picture to make it fit in the 16:9 window, it now outputs solid black bars. I might actually be able to watch non-HD programming on that expensive TV without wanting to throw things at it. Like I said, it's the small victories...

Friday, February 1, 2008

I'm Moving!

On April 7, I get the keys to my new apartment. This is important for several reasons. In no particular order:

- I will be freed from the mold-ridden, flood-prone craphole that is my current apartment.
- I will no longer be living on the first floor, and hence less prone to water issues.
- I will be moving into a 1-bedroom apartment, and therefore once again living alone.
- The new apartment is in a brand new building. I will be the first person to ever live in my unit.
- The new apartment is big at 862 square feet, and cheap for comparable units in the area.
- I will still be close to work. Only about 2.5 miles each way.

Here's an official floor plan, borrowed directly from their website:

Official Floor Plan

Get a good look at that? Good, because that's not at all what my unit will look like. I'm going to have a mirror-image flipped version of what you see above. I took the liberty of flipping the image horizontally so you can see what my unit will really look like. Note that the text in the image isn't so kind as to magically stay put.

My Floor Plan

Now that this is settled, I can start furniture shopping. I've only been putting that off for, oh... 3 years. Should be painful and expensive.

Hoover Dam

On my last day in Vegas, I decided it would be a better use of my time to head out to Hoover Dam and take a tour, as opposed to spending another crowded, sweaty day insaide the show halls at CES. Plus, it would be very un-like me to rent a car and put less than 50 miles on it... I average a lot closer to 2,000 miles per rental, after all.

The drive is about 20-30 minutes or so, even when going the speed limit in my bright Orange cop-magnet of a rental car. I managed to walk right up to the Visitor's Center at 10:50 on a Wednesday morning, and immediately get onto the 11:00 Dam Tour. Normally, people have to wait all day to get one one of those things. Ah, the advantages of traveling alone...

The gallery contains about 100 pictures from inside and outside of the Dam. Enjoy!


Nevada Generation Room

1930's-style waterproof light switch

The Hoover Dam Complex.

My Cop-Magnet rental car

CES 2008

Inside the gallery are a few pictures from inside CES 2008. Unfortunately, none of the pictures I was able to snap off adequately convey just how insanely packed the Convention Center was -- I am fairly certain I have never seen so many people in one place in my entire life. And with over 3.2 Million square feet of space inside the convention center, stuff was still spilling out into the parking lot and into other hotels! Pretty insane.

R2/D2 Home Entertainment Center

People, People Everywhere...

Las Vegas at Night

This gallery contains all the pictures I took when I was in Vegas for CES... but not actually on the show floor at CES because it was nighttime. The first night I was in town, I took a stroll on foot starting at my hotel and working my way through the Hooters Hotel and Casino, followed by the Tropicana and MGM Grand. From there, I took the Monorail to Bally's and walked through that and Paris. I then went across the street to the Planet Hollywood Hotel and Casino and the attached Miracle Mile shops. After all that walking, I was beat and headed back to the hotel early to get some rest for the long day ahead of me on the show floors at CES on Tuesday.

Paris Las Vegas.

Christmas Tree and Skylight

Pictures in this gallery:

- The Dog.
- The new enormous fake Christmas tree.
- The skylight that blew off the roof.

All pictures taken on either 1/1/08 or 1/5/08.

Enormous Fake Tree.

Christmas Lights

On Christmas Eve, Mom, Nicole, and I went for a drive to go looking at Christmas lights. There's a court nearby in Citrus Heights that goes pretty overboard with them, so of course we had to go there. This gallery contains the pictures that came out OK.

The Grinch!

Iron Mountain Hike

On December 22, 2007 I went on a hike with John, Kristine, and John's parents at Iron Mountain. Other people were supposed to come too, but they chickened out... or something. I forget why exactly they didn't make it.

View from the top.

2007 LIAI Holiday Party

Photos from the LIAI 2007 Holiday Party, which was held at the Prado in Balboa Park. In November.

Me, getting ready to eat dinner.

Flooding and Wildfires

Some photos from the aftermath of the most recent La Regencia "Flood" and the soot dropped from the San Diego wildfires. Photos taken between 10/19/07 and 11/10/07.

I <3 La Regencia Maintenance.

Nasty post-fire air filters.

Mission Trails Hike

Pictures from a hike I took on September 30, 2007 at Mission Trails. Nothing really exciting here, but it's time to start flushing out the backlog of photos.

Amusing sign.