Tech Blog‎ > ‎

2019-02-14/ SQL in/any, consider using an array in (select unnest(?)) instead of issuing statements in a batch or building dynamic statements

Problem statement

Imagine you have a set of n identifiers (IDs) that can't be represented by a range, and you want to delete (or update, or whatever) all rows containing those IDs from your relational database. How would you do this? What if n is huge?

Solutions

n static statements with one bind variable in a batch

You could issue n SQL commands (a.k.a. statements) with a bind variable:

delete from my_table where id = ?

Of course, you would not want to issue the commands one by one, you would organize them into a batch. In case you use JDBC, this is done via PreparedStatement.addBatch/Statement.executeBatch/Statement.executeLargeBatch. But despite this way commands are issued more efficiently, you still request n commands which DBMS will execute one by one. It is reasonable to assume that executing n commands takes more time than executing a single one that does the same thing as those n commands, and it seems to be true according to "The Performance Difference Between SQL Row-by-row Updating, Batch Updating, and Bulk Updating" comparison done by JOOQ.

One dynamic statement with n bind variables

So now you may want to dynamically build a single SQL command with n bind variables specified for the in comparison:

delete from my_table where id in (?, ?, ...)

However, for a large enough n we may face a limit imposed by a (poor?) implementation of a JDBC driver, like the one described here:

java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 100000
    at org.postgresql.core.PGStream.SendInteger2(PGStream.java:201)

which happened when 100 000 values were specified for the in comparison. Another problem with this approach is that we may end up generating many similar SQL statements which differ only by the number of bind variables. If there is a cache of execution plans in the DBMS or a cache of prepared statements in JDBC API implementation / JDBC driver, then not only we don't benefit much from them but also pollute them (that's why Hibernate ORM is using in parameter padding; see also hibernate.query.in_clause_parameter_padding).

One static statement with one bind variable of the array type of size n

As a result of all the aforementioned, it appears to me that a good option may be to use the SQL array type represented by java.sql.Array in JDBC:

delete from my_table where id in (select unnest(?))

We can create an array with Connection.createArrayOf and specify it by using PreparedStatement.setArray/PreparedStament.setObject. The reason for using the unnest function is that the in subquery expression expects (you guessed it) a subquery which returns a set of rows, not an array. Note that previously we were using the in comparison, while now we are using the in subquery expression. The function unnest converts an array to a set of rows (a.k.a. flattening).

But what about the performance of in comparison with multiple bind variables and in subquery expression with a single bind variable of the array type? I am so glad that JOOQ has again already compared it: "SQL IN Predicate: With IN List or With Array? Which is Faster?". In short:
  • for PostgreSQL, multiple bind variables seem to result in a smaller latency than arrays for n < 128, and the variant with an array results in a smaller latency for n >= 128;
  • for Oracle, the variant with an array is at least not worse than the variant with multiple bind variables if we help it to determine the array cardinality with a hint.
So there is no simple answer about the performance, but looks like at least for big enough n the variant with array results in smaller latencies, and also have the other aforementioned benefits. Btw, Hibernate ORM may also have it in the future.

JDBC Example

This technique turns out especially handy when you have multiple sets of IDs and want to request different updates for each set. It allows you to have a single SQL command for each set of IDs and issue all such commands in a batch. Here is an example code for such a situation ([DBMS] PostgreSQL, [JDBC driver] PostgreSQL JDBC Driver):

Map<String, Set<Long>> valueToIds = ...;
JdbcTemplate jdbcTemplate =
...;
jdbcTemplate.execute((Connection connection) -> {
try (PreparedStatement statement = connection.prepareStatement(
"update my_table set value = ? where id in (select unnest(?))")) {
valueToIds.forEach((value, ids) -> {
try {
statement.setString(1, value);
statement.setArray(2, connection.createArrayOf("bigint", ids.toArray()));
statement.addBatch();
}
catch (SQLException e) {
throw new RuntimeException(e);
}
});
return statement.executeBatch();
}
});

A temporary table with n rows

A cardinally different and significantly less convenient approach is to create a temporary table which contains all n IDs and then replace in with inner join by utilizing PostgreSQL-specific using:

delete from my_table my using tmp_table tmp where my.id = tmp.id;

or with ANSI SQL syntax:

delete from my_table my where my.id in (select id from tmp_table);

This technique is also described for example here. The only justification for this technique that I may think of is this remark from PostgreSQL docs: "In some cases, the join style is easier to write or faster to execute than the sub-select style".

Notes

= any(array[v1, v2, ...])

compared to

= any(values (v1), (v2), ...)

for PostgreSQL 9.0, but said that it was fixed in 9.3.

The reason I am mentioning this difference between the any comparison and the any subquery expression here is that in is equivalent to = any according to the docs, and therefore the mentioned performance bug probably also affected SQL commands with in. There is a slight difference between in and = any in PostgreSQL though: despite both can be either a comparison or a subquery expression, only the any comparison accepts an array and therefore can be used without using unnest.

Links