Disaster Recovery Wrap-up

As I eluded to in earlier posts, $work (a large university) held our annual disaster recovery test last week. This involved recovering the following items at our “cold” site in Atlanta, GA: network, backup server, and various systems supporting payroll, registrar, alumni, and grant proposals. We are supposed to try to get everything up and running in 72 hours, and we normally come very close, on average between 72 and 80 hours.

Before I go any further, I should explain what a “cold” site is. There are three types of sites normally used for disaster recovery. A “cold” site, meaning nothing is on-site until we declare a disaster, except for the physical datacenter. A “warm” site, which has some hardware that is not in production use, but is running and remotely accessible. And a “hot” site, which contains production systems. Every company should be striving to have a hot site, more commonly known as Business Continuity in ITIL speak. In a hot site, if a production system fails, users have no idea the system that were using is gone an continue working like nothing happened.

Six co-workers traveled to Atlanta on Sunday and recovered the backup server, network (including DNS), and our virtual infrastructure; albeit a bit smaller than our production environment. Two other system admins and myself left Monday morning and found all the “support” systems up and running waiting for us.

I was responsible for the payroll system and the alumni system. These two systems include 3 physical servers and 2 virtual servers. I started recovering the payroll system from a maksysb image, but that was taking a very long time. So while it was running, I brought up the 2 virtual machines, actually I just changed the IP address on snapshots of the production boxes, so it only took about 30 minutes total to recover both of those systems.

I started on the other 2 physical servers which required a clean install of RHEL5. Since I had created kickstart files for each of these systems, that was finished fairly quickly as well. On to recovering files from the backup tapes; Four hours later the maksysb was hung at 4% for some reason, so we determined a clean install was needed which was ready on the morning of day two.

The next day, the alumni system and one other physical server was ready for the database administrator to recover the Oracle instance. The database admins showed up, as planned, around lunch on day two. Not much happened day two other than watching progress bars, since that is what happens when you recover 2TB worth of a database.

The only thing that happened on day three was most of the file-systems on one of the physical boxes kept filling up. We traced this down to the volume groups. On our DR system they were created with a 512MB PP size and the production system was set to 1024MB PP size. This means that the script I ran to create the logical volumes and file-systems, created all of the file-systems half the size they should have been. And if you guessed, the server was running AIX.

We finished up Thursday evening with a successful recovery of all systems, and went out to enjoy dinner. [insert creepy music] Dinner was interrupted by a phone call from home base; Apparently the chiller failed and the server room was running at 90+ degrees. We, the sysadmins and DBAs, left for network coverage to start shutting down systems back home.

After about 2 hours, the chiller was restarted and once the temperature started dropping we brought the production systems back online with out incident. So it was a disaster in a disaster, [insert your favorite inception joke here].

Leave a Reply Cancel reply

Archives

Meta