Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark. Continue reading →
I’ve managed Hadoop clusters for just a little while now and I’ve discovered the user management factor of Ambari is a little rough around the edges. Specifically, there’s no easy way to manage Ambari LDAP users from within Ambari despite LDAP being a very popular way to provision and manage user access.
There is the command ambari-server sync-ldap [--users user.csv | --groups groups.csv] for adding users or groups but that can be an issue if access to the ambari user or server is limited. Additionally, the command line utility doesn’t innately have any control over HDFS directories (either creating or deleting) upon a user- or group-sync, creating extra steps in the user creation process.
ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.
I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it. Continue reading →
Apache Spark alone, by default, generates a lot of information in its logs. Spark Streaming creates a metric ton more (in fairness, there’s a lot going on). So, how do we lower that gargantuan wall of text to something more manageable?
One way is to lower the log level for the Spark Context, which is retrieved from the Streaming Context. Simply:
I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.
You’re likely also aware of its main components:
Spark Core: the parallel processing engine written in the Scala programming language
Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
Spark GraphX: a graphing library built on the Spark Core engine
Spark Streaming: a framework for handling data that is live-streaming at high speed
Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.
As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.
Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,
One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.
A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!
…And the Pitch!
In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.
The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.
If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading →