I moved a bunch of servers from one colocation facility to another last night. I had been preparing this move for a few weeks. While everything went well in the end, things took much longer than anticipated. Here are a few lessons learned.
1. preparation is *everything*
I spent a lot of time preparing for the move, on three levels: communication, hardware, and software.
Communicating the planned outage to people affected beforehand is obviously very important. Not only does it warn people of the planned outage, but it also helps identifying ways in which customers are affected that perhaps were not anticipated. Who knew that customer XYZ’s fetchmail script used a fixed mailserver IP instead of the proper hostname? Your customers will help identify potential problem areas before the migration, which will save everyone involved time and frustration.
On the hardware level, it helps to know what you are getting into. I prepared by making sure I was thoroughly familiar with the hardware I was going to deploy – especially the stuff that you don’t use that often when everything works as expected: PDUs, serial console server, etc.
As for software, this migration involved IP address renumbering, which is never fun. The physical machines all run a number of Xen instances that each run their own applications. That meant that IP addresses had to be modified all over the place. Meticulous inventorizing of what needs to change where before the migration pays off bigtime when the big move comes. I even went as far as updating config files with the new information before shutting down the machines at the old location. This worked very well, as it saved a lot of time and effort during the migration: I just switched on the machines and dealt with the (far fewer) files that could not be modified beforehand. Reducing the DNS time to live to 300 seconds well before the migration was essential, obviously.
2. pay attention to the little things
Details, details, details. I lost time because rack posts that needed moving back, as it turned out, were screwed down with Torx screws. Luckily I brought a lot of tools among which a set of allen wrenches – but this prevented me from using my battery operated drill/screwdriver. Speaking of which – there were quite a few screws that were recessed too far to be reachable, again requiring manual tightening. Note to self: bring extender bit for the drill next time, and see if I can get some hex/torx bits.
I also lost time because I got stuck in traffic driving to the old colo. While this was not such a big deal since everything was still up, it did shorten the amount of time I had to work with during the maintenance window. Lesson learned: time migrations and/or allot enough time for bad traffic.
3. prepare for the unexpected
Something will happen that you did not anticipate. In this case, a machine with Intel e1000 nics refused to properly autonegotiate with an HP Procurve switch, leading to all sorts of speed and duplex mismatch problems. This appears to be a bug in the e1000 firmware (yay!) and no matter what I tried with ethtool and setting fixed parameters on either side of the link, the nic would just not talk properly to the switch. Workaround: put a different switch in between the e1000 and the Procurve. Lesson learned: bring spare parts, even for things that you think won’t fail. Until yesterday I held Intel’s nics in very high esteem. Today – perhaps not as much. Bigger picture lesson: expect the unexpected, and try to prepare for all sorts of eventualities. This also means applying the freelance consulting rule to the timing of your maintenance window: estimate how much time you will need, and then do that times two or two and a half.
4. set aside time for the fallout
Aside from the autonegotiation problem, things went pretty well last night. The migration took about 4 hours longer than expected, but for the bulk of that time most services had already been restored. Still – my announced maintenance window was too short. It turned out not to be too big of a deal since I was working through a good part of the night, but there was some impact since I had to deal with customers in three timezones, in total 8 hours apart.
While everything took (much) longer than expected, there were only a couple smaller issues to deal with today. Still, you want to budget for some time after a migration to stabilize things. This is also a good time to document the new setup!