How Random Sampling in Hive Works, And How to Use It

Image Courtesy: https://faculty.elgin.edu/dkernler/statistics/ch01/images/srs.gif

Random sampling is a technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.

In the big data world, we have an enormous total population: a population that can prove tricky to truly sample randomly. Thankfully, Hive has a few tools for realizing the dream of random sampling in the data lake. Continue reading

Advertisements

Hive Tables, Internal and External, Explained

hive-logo

It’s time to break down what they mean, how to use them, and how to get the best of both worlds. Continue reading