I have been administering puppet installations for almost five years; FIVE years! I have always known that at some point, node management would become an issue. Unfortunately, I have never worked with an infrastructure that had massive numbers of exactly identical servers. I started by configuring nodes to inherit group definitions that were stored in another manifest. Then I saw the environments feature but never implemented it mostly due to overhead and the numerous and well documented bugs that will be solved in Puppet 4.
Category Archives: linux
Work Work Work
Back at work, building our SAP Business Objects production infrastructure. I have automated as much as possible, so it will be easier to manage later down the road.
I wish SAP would actually produce valid documentation, nowhere did I find the actual database permissions the installer needed to complete the install. I ended up having to give the DBA_ALL role, which slightly scares me. If you cannot document the permissions your application needs then you are doing something wrong.
I do have some light at the end of the tunnel, I have scheduled a vacation to Key Largo for some scuba diving in the next few week. It will be a needed break.
Disaster Recovery Wrap-up
As I eluded to in earlier posts, $work (a large university) held our annual disaster recovery test last week. This involved recovering the following items at our “cold” site in Atlanta, GA: network, backup server, and various systems supporting payroll, registrar, alumni, and grant proposals. We are supposed to try to get everything up and running in 72 hours, and we normally come very close, on average between 72 and 80 hours.
Before I go any further, I should explain what a “cold” site is. There are three types of sites normally used for disaster recovery. A “cold” site, meaning nothing is on-site until we declare a disaster, except for the physical datacenter. A “warm” site, which has some hardware that is not in production use, but is running and remotely accessible. And a “hot” site, which contains production systems. Every company should be striving to have a hot site, more commonly known as Business Continuity in ITIL speak. In a hot site, if a production system fails, users have no idea the system that were using is gone an continue working like nothing happened. Continue reading
Disaster Recovery
I have been pretty busy this week reviewing documentation and creating media for our disaster recovery test coming up next week.
Centralized Logging and Puppet
At $WORK I have been tasked with building a centralized logging infrastructure. After researching the available options I came across the following blog: edgeofsanity.net. The author is implementing centralized logging with Kibana and Logstash.
So I am following along, but since we only have 200 servers I am only building 2 servers, one to host Kibana and one running elasticsearch.
Filed under elasticsearch, linux, logstash, puppet, server
Triple Monitor Desktop
So I finally found enough time at $work to set up my triple monitor desktop. I received an new desktop and monitor a few months ago; I was already running dual monitors and the new system came with a monitor.
I am running Fedora16 on my desktop with Xfce desktop environment. The new system is a Dell Optiplex 980 with an ATI Radeon 3450HD. I added the ATI Radeon X600 (PCIE) card out of the old system, a Dell Optiplex GX620. Both cards have the capability to run dual monitors so I currently have the capability to run four screens. Configuration is posted below the fold.
Upcoming Events
Big things are happening at $work; Most of my time has been tied up architecting and helping design the flexpods we recently purchased from NetApp. More info on that can be found here.
The other item I have been busy with is our data center shutdown. We are getting a new UPS and we have to power down the primary server room for 48 hours during the installation.
Disaster Recovery – Part 1
This week has been pretty hectic at $work. On Tuesday at 1545 (3:45pm) our director came into the bullpin and announced that we had lost power to the server room and the UPS would die in approximately 10 minutes.
The database administrators immediately tried to log into every database server to shutdown the databases. Needless to say they did not get everything and every server crashed hard. Luckily the new flexpod is being hosted offsite since we have some cooling issues to fix. Most of us were at work until midnight with some staying until 0400.
The server administrators were not affected since we were moved out of the building that contained the server room about three years ago. We found the root cause was some construction work on another floor of the building.
One of the workers had caused an electrical short. Instead of tripping the breaker on the floor where work was being done, the main building breaker tripped. The generator has plenty of time to fire up and take the load. However, it was discovered that the bypass had been tripped the last time our contractors inspected the generator. And we went dark….