Spark History Server Automatic Cleanup

largelogpile
I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all.

This is because the default configuration for Spark regarding its history logs is to keep them all indefinitely. This can be a good thing and a bad thing obviously:

Good:

  • You have a record of all your spark apps from day one
  • You can track your progress as a super DUPER awesome Spark Developer

Bad:

  • Depending on the verbosity level, these logs can get VERY large. I mean VERY large:
    [hdfs@cluster1 ~]$ hadoop fs -du -h /
    
    6.8 G /app-logs
    
    1.3 G /mr-history
    
    1.5 T /spark-history
    
    25.0 M /tmp
    
    20.2 G /user
  • Remember with HDFS default replication, this is an actual size of 4.5 TERABYTES of just INFO messages telling you various executors are starting or stopping. Not very useful.

Set these parameters in your “Custom spark-defaults” config setting in Ambari (or your spark-env.sh file without Ambari) to take care of these massive logs:

1. spark.history.fs.cleaner.enabled=true

2. spark.history.fs.cleaner.interval=1d

3. spark.history.fs.cleaner.maxAge=7d

#1 enables the history cleaner, #2 sets the check-interval (every day, in this case), and #3 sets the maximum age of any log (7 days, in this case). Anything older than 7 days will be automatically deleted.

See the documentation here for more details:

https://spark.apache.org/docs/1.6.2/monitoring.html#viewing-after-the-fact

In my experience, this can take anywhere from a few hours to a whole day to actually take effect, but it does work! 45.4 Gb is much better than 1.5 Tb.

[hdfs@cluster1 ~]$ hadoop fs -du -h /

6.8 G /app-logs

1.3 G /mr-history

45.4 G /spark-history

25.0 M /tmp

20.2 G /user

After some Googling, according to this site, it takes ~675,000 pieces of paper to print 1 Gb of text, so that means 1,012,500,000 pieces of paper would be required to  print 1.5 Tb of logs. That’s around 31,000 metric tons of paper, or about half the weight of the Titanic. Let’s stick with hard drives.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s