Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.
I don't have to do much data analysis fortunately, but when I do there are two options: either the data is local to our own datacenter and I can use our own Hadoop cluster, or it is external and I can use Elastic MapReduce. Generally you don't run an Elastic MapReduce cluster all the time, so when you create your cluster you still need to get that data into the system somehow. Usually the easiest way is to use one of your existing running instances outside of the MapReduce system to transfer it from wherever it may be to S3. If you are lucky, the data is already in S3.
Even better, Elastic MapReduce has the ability to run jobs against
datasets located in S3 (rather than on HDFS as is usually the case). I
believe this used to be a customisation AWS has applied to Hadoop, but
has been in mainline for some time now. It is really quite simple -
instead of supplying an absolute or relative path to your hdfs
datastore, you can provide an S3-style URI to the data such as:
The "magic" here is not that it now runs the job against S3 directly, but it will create a job before your main workflow to copy the data over from S3 to HDFS. Unfortunately, it's a bit slow. Previously it has also had showstopper bugs which prevented it working for me at all, but in a lot of cases I just didn't care enough and used it anyway. Today's job had significantly more data, and so I decided to copy the data over by hand. I knew it was faster, but not as much of a difference as this:
The first part of the graph is the built-in copy operation as part of the job I had started, and where it steepens significantly is where I stopped the original job and started the S3DistCp command. Its usage is relatively simple:
hadoop fs -mkdir hdfs:///data/ hadoop jar lib/emr-s3distcp-1.0.jar --src s3n://my-bucket-name/path/to/logs/ --dest hdfs:///data/
The s3distcp jar file is already loaded on the master node when it is bootstrapped, so you can do this interactively or as part of a step on a cluster you have running automatically. I thoroughly recommend using it, as it will cut down the total time of your job significantly!