How Random Sampling in Hive Works, And How to Use It

Image Courtesy: https://faculty.elgin.edu/dkernler/statistics/ch01/images/srs.gif

Random sampling is a technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.

In the big data world, we have an enormous total population: a population that can prove tricky to truly sample randomly. Thankfully, Hive has a few tools for realizing the dream of random sampling in the data lake. Continue reading

Advertisements