hadoop

(something something) Big Data!

by Oliver on Saturday, April 12th, 2014.

I recently wrote about how I’d historically been using Pig for some daily and some ad-hoc data analysis, and how I’d found Hive to be a much friendly tool for my purposes. As I mentioned then, I’m not a data analyst by any stretch of the imagination, but have occasional need to use these kinds of tools to get my job done. The title of this post (while originally a placeholder for something more accurate) is a representation of the feeling I have for these topics – only a vague idea of what is going on, but I know it has to do with Big Data (proper noun).

Since writing that post, attempting and failing to find a simple way of introducing Hive usage at work (it’s yet another tool and set of data representations to maintain and support) I’ve also been doing a bit of reading on comparable tools, and frankly Hive only scratches the surface. Having a mostly SQL-compliant interface, there is a lot of competition in this space (and this blog post from Cloudera sums up the issue very well). SQL as an interface to big data operations is desirable for the same reasons I found it useful, but it also introduces some performance implications that are not suited to traditional MapReduce-style jobs which tend to have completion times in the tens of minutes to hours rather than seconds.

Cloudera’s Impala and a few other competitors in this problem space are attempting to address this problem by combining large-scale data processing that is traditionally MapReduce’s strong-point, with very low latencies when generating results. Just a few seconds is not unusual. I haven’t investigated any of these in-depth, but I feel as a sometimes-user of Hadoop via Pig and Hive it is just as important to be abreast of these technologies as the “power users”, so that when we do have occasion to need such data analysis, it can be done with as low a barrier to entry as possible and with the maximum performance.

Spark

http://spark.apache.org/

Spark is now an Apache project but originated in the AMPLab at UC Berkeley. My impression is that it is fairly similar to Apache Hadoop – its own parallel-computing cluster, with which you interact via native language APIs (in this case Java, Scala or Python). I’m sure it offers superior performance to Hadoop’s batch processing model, but unless you are already heavily integrating from these languages with Hadoop libraries it doesn’t offer a drastically different method of interaction.

On the other hand, there are already components built on top of the Spark framework which do allow this, for example, Shark (also from Berkeley). In this case, Shark even offers HiveQL compatibility, so if you are already using Hive there is a clear upgrade path. I haven’t tried it, but it sounds promising although being outside of the Cloudera distribution and not having first-class support on Amazon EMR makes it slightly harder to get at (although guides are available).

Impala

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

As already suggested, Impala was the first alternative I discovered and also is incorporated in Cloudera’s CDH distribution and available on Amazon EMR, which makes it more tempting to me for use both inside and outside of EMR. It supports ANSI SQL-92 rather than HiveQL, but coming from Pig or other non-SQL tools this may not matter to you.

PrestoDB

http://prestodb.io/

Developed by Facebook, and can either use HDFS data without any additional metadata, or with the Hive metadata store using a plugin. For that reason I see it as somewhat closer to Impala, although it also lacks the wider support in MapReduce deployments like CDH and Amazon EMR just like Shark/Spark.

AWS Redshift

http://aws.amazon.com/redshift/

Not really an open source tool like the others above, but deserves a mention as it really fits in the same category. If you want to just get something up and running immediately, this is probably the easiest option.

Summary

I haven’t even begun to scratch the surface of tooling available in this part of the Big Data space, and the above are only the easiest to find amongst further open source and commercial varieties. Personally I am looking forward to the next occasion I have to analyse some data where I can really pit some of these solutions against each other and find the most efficient and easy framework for my ad-hoc data analysis needs.

Tags: , , , , , , , , , , ,

Saturday, April 12th, 2014 Tech No Comments

Tools that make your life harder

by Oliver on Thursday, March 20th, 2014.

This post title is inspired by this Google Plus post, although I’ve been meaning to write this post for a few days anyway (it’s just a catchier title to describe the same thing). I’m not a data scientist, analyst or even a hard-code Hadoop user by any stretch of the imagination, but on occasion I need to do some log analysis when there is simply too much data to force through Awk (as much as I hate to admit it).

For perhaps the last year I’ve been using Pig. I say “using”, but really it is learning, trying, failing, learning again, scratching your head and maybe eventually using. Since my usage is fairly sporadic I forget everything after a month or so of not using it and then start again the next time I have to use it. I’ve been away from Java for so long it’s better to just say I never knew it in the first place, so whenever my job fails and I get a 50-line stack trace it is usually fairly difficult for me to piece together why it failed, and let’s not even talk about trying to write UDFs. It’s a tool that I needed, but it undoubtedly made my tech life harder in some respects.

But, on the whole it has made me more productive with Hadoop. I don’t have to write any Java, and when it does work it works fairly well. However, the Hadoop ecosystem is by now fairly rich (even if you consider Apache projects exclusively) and there are alternatives to Pig at similarly high levels of abstraction above the basic Map/Reduce system. I’ve been meaning to look into Hive for a while, especially as I read up recently on Redshift and concluded that the SQL-based approach is gaining popularity on multiple fronts. If you are not familiar with either Pig or Hive, essentially Pig has its own high-level language whereas Hive has a derivative of standard SQL called HiveQL (and if you already know SQL there is not much difference, at least from what I’ve seen so far).

I was slightly shocked (as well as pleased) at how much of a difference there is in expressiveness and understandability between the two. Here’s an example of analysing some logs to find out the top 5 contributors to cache misses by User Agent, using Pig:

cache_misses = FILTER raw_logs BY sc_status MATCHES 'TCP_MISS.*';
cache_misses_by_ua = GROUP cache_misses BY c_user_agent;
cache_miss_count_by_ua = FOREACH cache_misses_by_ua {
GENERATE
    group,
    COUNT(cache_misses) AS cnt;
}
ord = ORDER cache_miss_count_by_ua BY cnt DESC;
top5 = LIMIT ord 5;
DUMP top5;


This is actually a simplified version from what I had previously which involved nested blocks, but that only serves to further illustrate my point. I find it very hard to wrap my head around nested blocks and how to use them to get the data out in the way I want (usually involving ordering and limiting). Maybe my Pig fragment here could have been described more concisely, but I doubt I would understand it in any fewer lines.

For comparison, here is what I wrote to get the equivalent data out of Hive:

select count(*) as cnt, c_user_agent from raw_logs where sc_status like 'TCP_MISS%' group by c_user_agent order by cnt DESC limit 5;

Pretty straightforward, right? The Hive query above and the Pig script fragment both produce the exact same result set (ignoring the minor formatting differences). Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.

It appears I may have to knuckle down and write a deserializer in Java for at least one format that some of the logs I use are stored in, but outside of that I feel I would be much more productive in Hive. I can only recommend trying it out if you are wanting to get your hands dirty with Map/Reduce on Hadoop and don’t want to dive down to the murky Java depths of the system. Where Pig made my life harder, it seems Hive has the potential to make it vastly easier again.

Tags: , , , ,

Thursday, March 20th, 2014 Tech No Comments