Loxodontectomy: Elephant-removal (and replacement)
The rapid pace of the big data community can quickly leave Hadoop environments obsolete and out-of-date. Many great tools provide ways to simply upgrade your software without too much hassle. Unfortunately, earlier versions of the Hortonworks Data Platform (HDP) are a bit clunky to upgrade. A recent project of mine involved upgrading an older (<HDP2.1) version of HDP to v2.4. Upgrading the whole stack would have been very time consuming process (more than two weeks), so we decided to just transplant the edge node into a brand new cluster.
Our team works in what I can only describe as a “tensely agile” environment: our team works in sprints and the rest of the enterprise works in marathons. This tension causes difficulty when we have external dependencies to complete our work. Firewalls, account creation, and change advisory boards all work together to ensure things are done right, but slowly. Moving to a totally new cluster would have taken a precious few weeks to accomplish (moving processes from the old to the new)–time the dev team did not have. We luckily had the flexibility of working in a cloud environment (OpenStack, woo!)
So I decided to keep our original edge node, spin up a new environment with (out of date now, of course) HDP 2.4.2, upgrade the clients on the edge node and make it a part of the new cluster.
Here’s how I did that.
Obviously, I let the dev team know that their environment was going to be down for a day. I hoped it wasn’t going to take that long, but you never know with Hadoop. I also let them know that all their data in HDFS was going away. It would be a new cluster after all. Their stuff on the edge node would be fine though. I also let them know there was a chance this wasn’t going to work and they might lose everything, since I couldn’t find anyone else doing this online.
I saved the IP addresses of the master nodes and data nodes. I saved the /etc/passwd file from all the nodes (including the edge node) to maintain a list of user UIDs. I spun up the new cluster, installed HDP 2.4.2 and made sure it was all working properly and took note of the /etc/passwd file on these new nodes too. There were differences that needed to be addressed.
I placed the edge node in maintenance mode, just in case, and stopped the Ambari Agent. I shut down the old cluster and deleted all the old nodes (no point in backing it up, we couldn’t go back to the old cluster). I ran the cluster cleanup scripts that Hortonworks provides on the edge node:
This deletes yum repositories, symlinks, user groups, users, logs, anything Hadoop that might be on the system. At this point I had an edge node with no particular identity, just a server with an SFTP directory, developers’ user accounts, and outdated application code.
I then followed the same install guide for the new cluster on this old edge node. I registered the node with the new Ambari server, installed the new clients, updated various packages (this was on Centos 6, the new cluster ran on Centos 7) so everything behaved well.
Then the tricky part: copying the original developers’ accounts across the cluster so that there weren’t any user collisions. The new YARN user across the data and master nodes had a UID of 1004 while Alex’s account was UID 1004 on the old edge node. This would be a serious problem, so I knew I had to fix it. It was a pigeonhole principle problem from college all over again. This is where the old and new /etc/passwd files came in handy.
I moved the developers’ UIDs to 4000+ (no particular reason) and changed the ownership of all the system’s files to these “new” users with the help of this guide: https://muffinresearch.co.uk/linux-changing-uids-and-gids-for-user/
All the swapping took a while and keeping track of it all in Excel with very worthwhile.
Just like after a real surgery, you expose the site to rigors of normal function to ensure proper recover. I submitted map-reduce jobs, queried the YARN API, made config changes, and generally put the edge node through its paces to make sure it all worked.
Before turning it over to the developers, I remade the familiar HDFS structure our application depends on and moved some of the source data they work on over, so there’d be minimal down time on their end. Done.
All in all, the process was fairly painless, despite some headaches with user accounts. If you find yourself in a similar situation, I highly recommend the Loxodontectomy since we saved so much time.
Have questions? Need help? Drop a line below.