Infrastructure Manoeuvers


System administration is traditionally neglected by developers, exceptions to this trend are not easy to find. As a new member of the peerTransfer dev team I have found that the whole Dev team has a deep respect and understanding of system administration.

This dedication is reflected in one of our regular meetings: Infrastructure Manoeuvers.

Manoeuvers is a meeting that has various objectives including refreshing the machines organisation to the developers or optimization of key systems.

The meeting is comprised of a set of practical problems that @josacar kindly prepares beforehand. We, the developers, have to solve them working in pairs. The problems are mock system-failures manually started taken from past experiences or future risks.

Since causing trouble in production is not an option for us (no, we are not Netflix), these exercises are performed in the staging environment (a copy of the production environment were real failures and night calls happen -luckily not often)

OH MY GOD THEY KILLED UNICORN YOU BASTARDS!
OH MY GOD THEY KILLED UNICORN YOU BASTARDS!

We start with an alarm going off. We have no idea of problem or the cause ahead of time, we just see how messages come in from the monitor system. The first step is to refer to our internal wiki to find a diagnostic and then try to solve the problem as fast as possible, just as if it was happening any night in the production environment.

In the last manoeuvers, we came across these problems.

  • nginx process down
  • dead dhcp client
  • lost connectivity between machines
  • amazon pushes a machine out of a balancer
  • unicorn dies leaving zombie processes

By just reading the diagnostic it seems easy to fix but actually the only information we receive is a couple of emails and automatic chat messages (also sms in production) reporting us that one machine is not answering requests through a specific port.

To wrap up the meeting we have a debate to decide future improvements for the system and comparing the different approaches we took while we were solving the problems.

Summarising, it was a meeting where the development team had a lot of fun and @josacar even more watching us how we suffered.

Happy rebooting!