As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.
Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,
Don’t Just Plug That Disk In.
Here’s what’ll happen:
You’ve got a data node with three disks that you use as your HDFS storage:
DataNode01: FS Size Used Avail Use% Mounted on /dev/vdb 100G 70G 30G 70% /data/disk01 /dev/vdc 100G 70G 30G 70% /data/disk02 /dev/vdd 100G 70G 30G 70% /data/disk03
You say to yourself: “Hey, I don’t like that my HDFS storage is getting kind of high and I’ve got three spare hard drives just laying around. I’ll increase storage by plugging them in!”
So you plug them in, create a file system, and mount the drives to new directories:
DataNode01: FS Size Used Avail Use% Mounted on /dev/vdb 100G 70G 30G 70% /data/disk01 /dev/vdc 100G 70G 30G 70% /data/disk02 /dev/vdd 100G 70G 30G 70% /data/disk03 /dev/vde 100G 100G 0G 0% /data/disk04 /dev/vdf 100G 100G 0G 0% /data/disk05 /dev/vdg 100G 100G 0G 0% /data/disk06
And just like that you’ve doubled your cluster’s capacity!
You’re also about to have a bad time.
Three Weeks Later
You get a phone call in the middle of the night saying that every job running on the cluster is failing with “NoSpaceLeftOnDevice” errors.
“That’s not possible!”, you cry, exasperated. “We just doubled our storage three weeks ago!”
You log in to DataNode01 and look at the disks:
DataNode01: FS Size Used Avail Use% Mounted on /dev/vdb 100G 99G 1G 99% /data/disk01 /dev/vdc 100G 99G 1G 99% /data/disk02 /dev/vdd 100G 99G 1G 99% /data/disk03 /dev/vde 100G 29G 71G 29% /data/disk04 /dev/vdf 100G 29G 71G 29% /data/disk05 /dev/vdg 100G 29G 71G 29% /data/disk06
I’ve taken the liberty of bolding the offending disks. You’ll notice they’re the original disks. And despite your best efforts, they’re basically full.
HDFS by default uses a round-robin style of file management. So if you’ve got 7 files (or blocks) to write, they’ll actually be written like this:
- Block1 –> Disk01
- Block2 –> Disk02
- Block3 –> Disk03
- Block4 –> Disk04
- Block5 –> Disk05
- Block6 –> Disk06
- Block7 –> Disk01
The important thing to see here is that Blocks1 and 7 are BOTH written to Disk01. Adding those new disks certainly increased your total HDFS storage. Your old disks, though, are still nearing 100%. Whenever HDFS tries to write a block from a job, you’ll run out of space and the job will fail.
The Fullest Disk is the Weakest Link.
This phenomenon and fix is fully discussed on a JIRA ticket here: https://issues.apache.org/jira/browse/HDFS-1804
TL;DR: round-robin or available space based HDFS by setting the config “dfs.datanode.fsdataset.volume.choosing.policy” to the value “org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy” in your hdfs-site.xml file.
So how do you fix it?
You’ve got a few options, none are particularly pleasant:
- Create a new node (really only available in a cloud environment), decommission the old node, and then recommission the old node once everything’s balanced out.
- Decommission the data node, format all the disks, and recommission it allowing HDFS to gradually replace ‘under-replicated blocks’ on the newly formatted disks.
- If you like building things from scratch, there’s a few disk balancers out there .
- MANUALLY (aka — dangerous) move the data blocks from disk to disk with the following command. I take no responsibility for lost or corrupt data. In fact, I recommend against this option because you need to know exactly what you’re doing to not mess up. Don’t forget to turn off your DataNode process first!
rsync --remove-source-files -av disk01/hadoop/hdfs/data/current/<BlockPool String>/current/finalized/<subdir0-N>/* disk06/hadoop/hdfs/data/current/<BlockPool String>/current/finalized/<subdir0-N>/
Avoid the situation
You can, of course, avoid this situation entirely by making a few key changes to how you add space to your cluster. Adding fresh, new nodes to the cluster instead of plugging your unused hard drives into existing nodes will circumvent the round-robin issue. You will have to rebalance HDFS once the nodes have been stood up.
This is the recommended practice as cluster hardware will get old, fail, or fall out of warranty. By maintaining a healthy cycle of servers flowing in and out of your cluster, you can protect yourself against sudden failures or full disks.
Got other ideas? Let me know below!