October | 2011 | Bytes and Bullets

Monthly Archives: October 2011

October 21, 2011 · 6:00 pm

Disaster Recovery – Part 2

The power came back up around 1615 and once the batteries on the UPS were charged and the network guys said they were good, we were allowed to start bringing systems up.

First to come up was our fiber channel SAN infrastructure. This consists of two director series switches and five pair of controllers. Parallel to this we brought up the failover DNS/DHCP server. Side note: We have ninety percent of our campus covered with 802.11n and very short lease times so as to not exhaust our lease space. So the first thing the DHCP server did when it came up was spend about twenty minutes expiring the 14,000 previously active wireless leases.

Once the core infrastructure was up and running, it was time to bring up the virtualized infrastructure. Our virtual infrastructure consists of 18 hosts running around 500 virtual guests. So first I had to find the management server. Of course it would not be a disaster with just a power outage; The host configured to run all the various servers that manage the VMware environment was toast and would not boot. Dell ended up replacing every component in the server to get it back up and running.

I spent the next 30 minutes tracking down which hosts the other management servers lived and powered them on. I ended up adding the management server to one of the functioning hosts to get it back up. Once the management servers were running we started brining up the other production systems. Of course it was now close to 2100 hours.

Disaster Recovery – Part 1

This week has been pretty hectic at $work. On Tuesday at 1545 (3:45pm) our director came into the bullpin and announced that we had lost power to the server room and the UPS would die in approximately 10 minutes.

The database administrators immediately tried to log into every database server to shutdown the databases. Needless to say they did not get everything and every server crashed hard. Luckily the new flexpod is being hosted offsite since we have some cooling issues to fix. Most of us were at work until midnight with some staying until 0400.

The server administrators were not affected since we were moved out of the building that contained the server room about three years ago. We found the root cause was some construction work on another floor of the building.

One of the workers had caused an electrical short. Instead of tripping the breaker on the floor where work was being done, the main building breaker tripped. The generator has plenty of time to fire up and take the load. However, it was discovered that the bypass had been tripped the last time our contractors inspected the generator. And we went dark….

Monthly Archives: October 2011

Disaster Recovery – Part 2

Disaster Recovery – Part 1

Archives

Meta