Sampling a Sequence with Clojure
We needed to sample a data set that had around 392 million entries in it. The first thing we thought of was using the database. The second thing we thought of was by creating an array of the record IDs (or line numbers), shuffling the array and then selecting those records that matched (via the line number or record id).
SQL
SQL Databases do support selecting random samples of records. Looking at SQL to Select a random row from a database table, most use a variation of the following:
We asked PostgreSQL how expensive this would be on our data set and…it reported a cost in the hundreds of millions. We weren’t too confident that this would execute in any reasonable time frame based on my (limited) understanding of what was required: to generate a random value for each row and then sort the entire set of random values before the sample set could be taken (which would admittedly be fast – given the sorting has completed). Testing on a million rows, the database returned results after some time – thinking of having to scale that up by 2 orders of magnitude resulted in a lot of pessimism.
We started a query using this approach anyway, and then we started discussing and googling for other approaches.
Shuffle
If the database was going to be slow at it, would it be any faster if we took the same approach by hand? Could we do it in memory in the JVM?
With a 2048M Heap, we tried allocating a long array with enough space to hold 392 million and it resulted in an OOM. Going up to a 4096M heap and down to an array of ints, we were able to allocate an array and then populate it with the sequential line numbers in Clojure took about 80 seconds on the box we were working with.
There is a Clojure function shuffle
which under the hood defers to java.util.Collections/shuffle
, an optimized shuffle that is supplied with the JRE. This version of shuffle would have required us to use non-native arrays and use the wrapper classes – at these sizes, it would seem, a non-starter.
A pure Java implementation of the array creation and sorting (Fisher Yates / Knuth Shuffle) was pretty easy to create:
And it ran pretty quickly as well at just over a minute (78s). That still left us with having to now use the result of that operation as a set in order to filter the records from the original data set.
We kept wondering if there wasn’t a way to do it in a single pass, so we kept googling. Then we ran across an article that talked about an alternate approach that would allow the set to sampled while it was streamed in a single run: How to pick a random sample from a list
Based on that article we were able to craft an implementation in Clojure that could operate on any other sequence, randomly selecting records from a given prefix of the sequence to hit a target sample size:
We ran it on our 19 Gigabyte file (the 390 million record file) and it ran in under ten minutes! This didn’t feel much longer than it took to stream the file through Java’s IO. We ended up implementing make-periodic-invoker
as a ‘throbber’ so we’d have a progress update – thanks to the way this approach works, it could estimate the amount completed (and could have estimated the total and time remaining).
The algorithm isn’t guaranteed to produce the target sample set, though it is very likely to. As it gets closer to the number of items you wished the sample to be taken from (the population-size
), it becomes more and more likely to select elements – finally reaching a probability of 1 at the last element of the population if it hasn’t completed the set at that point.
We did a quick analysis of the distribution of the values and things looked great. We’re happy to have the new technique and the new library code, you can leverage it as part of the clj-etl-utils project.