So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me. Continue reading
I’ve managed Hadoop clusters for just a little while now and I’ve discovered the user management factor of Ambari is a little rough around the edges. Specifically, there’s no easy way to manage Ambari LDAP users from within Ambari despite LDAP being a very popular way to provision and manage user access.
There is the command
ambari-server sync-ldap [--users user.csv | --groups groups.csv] for adding users or groups but that can be an issue if access to the
ambari user or server is limited. Additionally, the command line utility doesn’t innately have any control over HDFS directories (either creating or deleting) upon a user- or group-sync, creating extra steps in the user creation process.
To address this, I present:
From the command line, it’s easy to see the current state of any running applications in your YARN cluster by issuing the
yarn top command. Continue reading
As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.
Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,
Don’t Just Plug That Disk In.
If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading
You just kicked off a command on the command line and one of three things happens:
- You have to leave your computer and run off to a meeting or go talk to your boss,
- You realize you’ve made a terrible mistake and didn’t realize how much data or work that command has to deal with and it’s probably going to take a few hours,
- You were testing a command, liked that it was working, and want to let it run now
Loxodontectomy: Elephant-removal (and replacement)
The rapid pace of the big data community can quickly leave Hadoop environments obsolete and out-of-date. Many great tools provide ways to simply upgrade your software without too much hassle. Unfortunately, earlier versions of the Hortonworks Data Platform (HDP) are a bit clunky to upgrade. A recent project of mine involved upgrading an older (<HDP2.1) version of HDP to v2.4. Upgrading the whole stack would have been very time consuming process (more than two weeks), so we decided to just transplant the edge node into a brand new cluster. Continue reading
Apache NiFi is changing the way people move huge amounts of data. It’s never been easier (or cheaper) to move and transform raw data into meaningful insight.
With that being said, the release notes for Apache NiFi 1.0.0-BETA were released a couple days ago! These are very exciting times as this project becomes more mature.
Here’s the biggest updates:
- A brand new UI (!!!)
- Cleaner lines, better toolbars, more modern look. Things will be easier to find.
- Zero master clustering
- This means that NiFi’s become much more resilient to failure while staying lightweight. Individual clients “elect” a leader to handle coordination between clients. This is similar to how YARN High-Availability works.
- Actual multi-user tenancy
- There’s a drop-down for admins to create users, set permissions, visibility, etc for actual users. Before, there were really two types of users: Admins and Managers. Admins ruled the whole workflow, Managers could make changes to workflows. Now, individual components can be controlled for individual users.
- Version control for templates
- This is really huge. The upload/implement template function of NiFi was fairly well hidden and obscure before. With version control, I’d hope that a user would be able to check in/out various branches of the template XML files easily.
Beyond that, there’s a few more processors built-in now and a ton of bug fixes.
Please note, I’m not a contributor/committer of Apache NiFi, but I do follow and use the tool fairly regularly. It’s exciting to see the community making these changes. If you see something in the release notes that excites you, let us know below!
If you’ve spent any time with a Hortonworks Data Platform cluster, you’re familiar with Ambari. It’s one of the finest, open source cluster management tools that allows you to easily first launch a cluster, add or remove nodes, change configurations and add services to your cluster. Using Ambari takes a lot of the guesswork out of managing a hadoop cluster and I absolutely love it.
The one downside of Ambari is that it can be tedious to add functionality to the core client. For that reason, the smart people building the tool in Apache decided to add something called an Ambari View. An Ambari View is a way to extend the functionality of Ambari without going down the rabbit hole of modifying Ambari’s source code. Views are essentially plug-and-play tools that only require restarting your cluster to work.
In the following blog post, I’ll discuss getting your View off the ground and show you several tips about actually using them.
Next Post: Apache Ambari: Hello World!
Hello World with Ambari Views
Previously, I gave a brief overview of what an Ambari View is and how it can be beneficial to you.Let’s dive in! Continue reading