Spark History Server Automatic Cleanup

largelogpile
I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading

Apache Crunch Tutorial #4: Distincts, Materialization, and Objects

LITTLE_CRUNCH

This post is the fourth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading