Clojure and Large Result Sets
I’ve been working with the very useful clojure.contrib.sql package. One wrinkle I’ve encountered, in an otherwise lazy language, is that the SQL connections typically are not lazy. By default they load the entire result set at once – even if the
java.sql.ResultSet is lazily accessed. This isn’t an issue with
clojure.contrib.sql inasmuch as it is with how the JDBC classes are being used.
For many use cases, fetching a handful of rows, a user profile, etc., this a perfectly reasonable default. When you’re working with result sets that are larger than will fit into memory this default behavior isn’t so desirable.
LIMIT using an Offset
I like the interface that the contrib library offers – a lazy sequence of the result set. Since PostgreSQL supports ranges as part of it’s SQL (via
LIMIT) it would seem that you could just write up a quick chunking function that pulls ranges of records from the table using a lazy approach, while still presenting things as a single stream of records from a single result set.
The first function,
get-record-range, simply takes a table name a starting offset and a record count. It executes a select for the given range and offset. The second function uses the first to create a lazy sequence, generating the next block of records when the previous block was exhausted. It includes a check to ensure it terminates when there are no more records.
The only piece missing from the above is our
db/exec-sql, which is just a convenience wrapper around
clojure.contrib.sql/with-query-results that forces the lazy record sequence (this was not the source of our memory troubles).
db/exec-sql is something I often use to quickly grab the entirety of a result set.
The problem with this approach though, is that using a
SELECT * and a
LIMIT without an
ORDER BY clause doesn’t guarantee the ordering of the records. The database (for an example, see PostgreSQL) is free to return rows in whatever order it pleases. The impact of this is that the rows you get back will not be consistent – you are not guaranteed to get all of the rows through this approach, nor are you guaranteed to not get duplicate rows! For a table where we tested this approach, with approximately 600k records, we got about 400k unique rows back.
ORDER BY id (since most of our tables contain an
id column) resolves these issues, only to incur a large performance penalty while the database computes an ordered list of
ids so that it can return the requested block of records.
Back to the drawing board: Try Database Cursors
Databases have long had a way of supporting this use case via Cursors. JDBC, and specifically the
ResultSet interface, was designed to support them.
It turns out that the default settings on the
java.sql.Connection configured by
clojure.contrib.sql are not set to use a Cursor. According to ‘PostgreSQL JDBC Driver: Getting results based on a cursor’ there are several settings that need to be in place for the result set to use a cursor:
- Auto Commit must be turned off
java.sql.ResultSetmust be created with a
- The Query must be a single statement, meaning it must not contain semi-colons joining together multiple statements.
- You must use a
java.sql.Statementand set the
FetchSizeon the statement to a non-zero value.
It turns out that
clojure.contrib.sql has a handy form that will turn off
AutoCommit and ensure it will be set back to its previous value so you don’t put the
Connection into an inconsistent state (and so that you join any transactions that are in progress). That form is
transaction, and trying it out:
The output shows that the
AutoCommit is indeed off, but we’ll still need to set a
ResultSet type is
ResultSet.TYPE_FORWARD_ONLY, which is what we needed. Adding in @setFetchSize is straightforward, so, wrapping it all up into a helper function (following the conventions in the
sql package) I end up with:
This brings me back to my goal: being able to wield very large result sets.