Clojure and Large Result Sets
I’ve been working with the very useful clojure.contrib.sql package. One wrinkle I’ve encountered, in an otherwise lazy language, is that the SQL connections typically are not lazy. By default they load the entire result set at once – even if the java.sql.ResultSet
is lazily accessed. This isn’t an issue with clojure.contrib.sql
inasmuch as it is with how the JDBC classes are being used.
For many use cases, fetching a handful of rows, a user profile, etc., this a perfectly reasonable default. When you’re working with result sets that are larger than will fit into memory this default behavior isn’t so desirable.
First Attempt: LIMIT
using an Offset
I like the interface that the contrib library offers – a lazy sequence of the result set. Since PostgreSQL supports ranges as part of it’s SQL (via LIMIT
) it would seem that you could just write up a quick chunking function that pulls ranges of records from the table using a lazy approach, while still presenting things as a single stream of records from a single result set.
The first function, get-record-range
, simply takes a table name a starting offset and a record count. It executes a select for the given range and offset. The second function uses the first to create a lazy sequence, generating the next block of records when the previous block was exhausted. It includes a check to ensure it terminates when there are no more records.
The only piece missing from the above is our db/exec-sql
, which is just a convenience wrapper around clojure.contrib.sql/with-query-results
that forces the lazy record sequence (this was not the source of our memory troubles). db/exec-sql
is something I often use to quickly grab the entirety of a result set.
The problem with this approach though, is that using a SELECT *
and a LIMIT
without an ORDER BY
clause doesn’t guarantee the ordering of the records. The database (for an example, see PostgreSQL) is free to return rows in whatever order it pleases. The impact of this is that the rows you get back will not be consistent – you are not guaranteed to get all of the rows through this approach, nor are you guaranteed to not get duplicate rows! For a table where we tested this approach, with approximately 600k records, we got about 400k unique rows back.
Adding an ORDER BY id
(since most of our tables contain an id
column) resolves these issues, only to incur a large performance penalty while the database computes an ordered list of ids
so that it can return the requested block of records.
Back to the drawing board: Try Database Cursors
Databases have long had a way of supporting this use case via Cursors. JDBC, and specifically the ResultSet
interface, was designed to support them.
It turns out that the default settings on the java.sql.Connection
configured by clojure.contrib.sql
are not set to use a Cursor. According to ‘PostgreSQL JDBC Driver: Getting results based on a cursor’ there are several settings that need to be in place for the result set to use a cursor:
- Auto Commit must be turned off
- The
java.sql.ResultSet
must be created with aResultSet
type of:ResultSet.TYPE_FORWARD_ONLY
- The Query must be a single statement, meaning it must not contain semi-colons joining together multiple statements.
- You must use a
java.sql.Statement
and set theFetchSize
on the statement to a non-zero value.
It turns out that clojure.contrib.sql
has a handy form that will turn off AutoCommit
and ensure it will be set back to its previous value so you don’t put the Connection
into an inconsistent state (and so that you join any transactions that are in progress). That form is transaction
, and trying it out:
The output shows that the AutoCommit
is indeed off, but we’ll still need to set a FetchSize
ourselves.
The default ResultSet
type is ResultSet.TYPE_FORWARD_ONLY, which is what we needed. Adding in @setFetchSize
is straightforward, so, wrapping it all up into a helper function (following the conventions in the sql
package) I end up with:
This brings me back to my goal: being able to wield very large result sets.