Copying chunks of data in and around Hadoop is relatively trivial. But moving larger chunks can be time-consuming or needlessly complicated. Sometimes you even want to move data between Hadoop clusters (if you have 2 or more). With this article, I’ll show you a great way to handle all of these scenarios.
Copying Data Within a Cluster
You probably already know that you can copy data from one path in HDFS to another. That can be accomplished with this brief Hadoop file system command:
hadoop fs -cp /data/in/hdfs/ /new/path/in/hdfs/
But that’s a sequential copy. Did you know you can do a distributed, Map/Reduce-powered copy of that very same command? It’s called Distributed Copy, or DistCP for short, and it looks like this:
hadoop distcp hdfs://cluster1edge01:8020/data/in/hdfs/ hdfs://cluster1edge01:8020/new/path/in/hdfs/
This command issues a hadoop distributed copy of data in the path /data/in/hdfs/, and points the data to be sent to the path /new/path/in/hdfs/ on the same cluster. You might also notice the use of cluster1edge01 in the above command. This represents the hostname of the machine designated as your cluster’s namenode. If you’re unfamiliar with the concept of the namenode, click here. Either way, the namenode is the one who will accept these DistCP requests, so it’s required that you point to them in this command.
This is wonderful if data in the first path is quite large and expansive, which could take a sequential copy (the first command shown in this post) many minutes, or even timeout. The first command is good for moving of manageable files, while DistCP is more for moving big ol’ data. Best of all, it submits and executes to the job tracker as a Map-Reduce job. Truly a parallel copy.
But what if you want to copy from your first Hadoop cluster to another Hadoop cluster? Well, first off, congratulations — you or the company you work for have a sweet setup. Second, DistCP can handle that too. And you don’t have to change much either.
Copying Data from Cluster to Cluster
The command is pretty much the same, with one small difference…
hadoop distcp hdfs://cluster1edge01:8020/data/in/hdfs/ hdfs://cluster2edge01:8020/new/path/in/hdfs/
Did you catch it? The second path points to cluster2, instead of cluster1. This is telling the DistCP command to issue a copy of the data in the path specified on cluster1, and send it to the specified path on cluster2. Yeah, it’s that easy.
Rules and Things of Note
DistCP isn’t all fun and games. There are rules and things to note: Namely…
- You must specify the namenode(s) of the system in the command.
- DistCP will not overwrite files on copy… unless you tell it to with the overwrite parameter.
- DistCP has an update parameter, but make sure you understand how it works before you use it.
- DistCP will not create a source’s parent directory for you on the target. There are workarounds, but it’s best to make sure those exist to get the desired results.