Kotlin's hidden costs - Benchmarks

Post date: Jun 25, 2017 1:56:17 PM

This article has been translated into Japanese!

A series of blog posts called Exploring Kotlin’s hidden costs , written by @BladeCoder, demonstrated how certain Kotlin constructs have a hidden cost.

The actual hidden cost is normally due to the instantiation of an invisible Object or the boxing/unboxing of primitive values. These costs are specially hard to see for a developer who doesn't understand how the Kotlin compiler translates such constructs to JVM bytecode.

However, just talking about hidden costs without putting some numbers on said costs makes one wonder how much they should actually worry about them. Should these costs be taken into consideration everywhere in the codebase, meaning that some Kotlin constructs should just be forbidden outright? Or should these costs only be taken into consideration in the tightest inner loops?

Even more provocatively, do these so-called costs actually result in performance penalties (given how the JVM actively optimises code at runtime, compiling it to efficient machine code based on actual usage, the answer to this question may not be as clear as it seems)?

Without putting numbers to the hidden costs, that's impossible to answer.

For that reason, I decided to write JMH benchmarks to try to quantify the actual costs of each Kotlin construct mentioned in all the 3 parts of that blog post series published so far.

Methodology and system

Some of the Kotlin constructs' costs mentioned in the blog posts can be directly compared to the equivalent Java construct. For example, the cost of Kotlin lambdas can be directly compared to the cost of Java lambdas. However, many Kotlin constructs don't have a Java equivalent, in which case instead of comparing the Kotlin construct with the equivalent Java version of it, I compare them with the author's suggestions to improve on the costly constructs.

The code is on GitHub, so anyone can run it in their own system to see if the numbers match (it would be very interesting to collect some results in the Reddit's comments) in different systems.

If you do run them, beware the full benchmark takes a few hours to run.

All results were collected using the following system:

Macbook Pro (2,5 GHz Intel Core i7, 16GB of RAM)

Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

The Kotlin version (1.1.3) and JMH (0.5.6) are the latest as of writing (see the pom.xml).

Update: Android ART benchmarks are available in a WillowTreeApps.com blog post!

Part 1

https://medium.com/@BladeCoder/exploring-kotlins-hidden-costs-part-1-fbb9935d9b62

Higher-order functions and Lambda expressions

In this first example, it was not very clear what cost the author was talking about. It appears that he is just referring to the cost of using Kotlin lambdas, given that Kotlin, by default, targets the Java 6 VM which doesn't have lambdas.

However, the advice given later in the post is specific to capturing lambdas, not just any lambdas, and specifically, not the one given in the example:

fun transaction(db: Database, body: (Database) -> Int): Int {

db.beginTransaction()

try {

val result = body(db)

db.setTransactionSuccessful()

return result

} finally {

db.endTransaction()

}

Which is used with the following syntax:

val deletedRows = transaction(db) {

it.delete("Customers", null, null)

}

The hidden cost here appears to be only the fact that a Function Object might be created when the transaction function is invoked above. But as the author himself notices, that's not the case for this particular function because it is not a capturing lambda, so a singleton Function instance is created and used on every invocation.

The only remaining cost mentioned in the post is not a runtime cost: the extra 3 or 4 methods created by the Function class generated by the Kotlin compiler.

Anyway, I decided to check whether there's a real runtime cost related to Kotlin lambdas when compared to Java 8's lambdas because I was left with the impression that such cost should exist from the description of the problem (the cost of using a capturing lambda, which the actor advises against in the advice given in this part of the post, will be benchmarked in Part 2, so read on).

The equivalent function was implemented by me using Java 8:

public static int transaction( Database db, ToIntFunction<Database> body ) {

db.beginTransaction();

try {

int result = body.applyAsInt( db );

db.setTransactionSuccessful();

return result;

} finally {

db.endTransaction();

}

The syntax for calling this function in Java 8 is just slightly different from Kotlin:

int deletedRows = transaction( db, ( database ) ->

database.delete( "Customer", null, null ) );

And here is the results of the benchmark comparing the Kotlin version with the Java 8 version:

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part1.KotlinBenchmarkPart1.javaLambda thrpt 200 1024302.409 1851.789 ops/ms

c.a.k.part1.KotlinBenchmarkPart1.kotlinLambda thrpt 200 1362991.121 2824.862 ops/ms

In the above chart, higher is better (more ops/ms).

Notice that this example should also show the overhead of using a Kotlin lambda which returns an Integer VS the specialized version (ToIntFunction) used in the Java 8 example.

However, the Kotlin lambda seems to be significantly faster than the Java lambda. So, the cost here seems to be negative, as Kotlin actually ran around 30% faster than Java! The mean error is a little bit larger for Kotlin than for Java, but that' not even visible in the above chart, which does attempt to show the error bar (but it's just too small to be seen).

Well, in any case, the solution offered by the author to make the Kotlin lambda less costly (even though it doesn't seem like you should worry about that) is to inline the function, which can be done by simply declaring the transaction function with the inline keyword.

When that's done, we get the following result:

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part1.KotlinBenchmarkPart1.kotlinInlinedFunction thrpt 200 1344885.445 2632.587 ops/ms

As you can see, using an inline function in this example did not significantly improve the performance of the Kotlin lambda at all. If anything, it got just a little bit worse.

I have absolutely no idea why these results are the complete opposite of what the author and I, admittedly, expected. Looking at the benchmark code, I can't see anything that might be obviously wrong, so I am carefully confident that these figures are real.

UPDATE: See this discussion for possible reasons for the surprise results.

Companion Objects

As the author shows, companion objects seem to imply an overhead due to the generated synthetic getters and setters for access to class properties. In the worst case scenario, the first getter might need to even call a second getter, an instance method of the companion object, in order to get a simple constant value.

I decided to combine the companion object examples shown in the blog post to try to measure the cost of the worst case scenario using the following Kotlin class:

class MyClass private constructor() {

companion object {

private val TAG = "TAG"

fun newInstance() = MyClass()

}

fun helloWorld() = TAG

}

Here's the benchmarked Kotlin function:

fun runCompanionObjectCallToPrivateConstructor(): String {

val myClass = MyClass.newInstance()

return myClass.helloWorld()

}

The cost of the above Kotlin code is being compared against the equivalent, straight-forward Java implementation using just a simple static final String within the class itself:

class MyJavaClass {

private static final String TAG = "TAG";

private MyJavaClass() {

}

public static String helloWorld() {

return TAG;

}

public static MyJavaClass newInstance() {

return new MyJavaClass();

}

The Java method used was this one:

public static String runPrivateConstructorFromStaticMethod() {

MyJavaClass myJavaClass = newInstance();

return myJavaClass.helloWorld();

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part1.KotlinBenchmarkPart1.javaPrivateConstructorCallFromStaticMethod thrpt 200 398709.154 800.190 ops/ms

c.a.k.part1.KotlinBenchmarkPart1.kotlinPrivateConstructorCallFromCompanionObject thrpt 200 404746.375 621.591 ops/ms

Again, Kotlin seems to have a better performance than Java, if only by a tiny margin this time.

Part 2

Local Functions

In this part of the blog post, the author theorizes about the hidden costs of Kotlin's local functions. The only cost, it seems, is the creation of a Function object for capturing functions, but not for functions that do not capture anything from its context.

To test that, we start with a Java local function, or lambda, that avoids boxing but captures one variable from the context:

public static int someMath( int a ) {

IntUnaryOperator sumSquare = ( int b ) -> ( a + b ) * ( a + b );

return sumSquare.applyAsInt( 1 ) + sumSquare.applyAsInt( 2 );

}

The exact same function is implemented in Kotlin as an example given in the blog post:

fun someMath(a: Int): Int {

fun sumSquare(b: Int) = (a + b) * (a + b)

return sumSquare(1) + sumSquare(2)

}

A second Kotlin version which avoids capturing anything from its context is also tried:

fun someMath2(a: Int): Int {

fun sumSquare(a: Int, b: Int) = (a + b) * (a + b)

return sumSquare(a, 1) + sumSquare(a, 2)

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part2.KotlinBenchmarkPart2.javaLocalFunction thrpt 200 897015.956 1951.104 ops/ms

c.a.k.part2.KotlinBenchmarkPart2.kotlinLocalFunctionCapturingLocalVariable thrpt 200 909087.356 1690.368 ops/ms

c.a.k.part2.KotlinBenchmarkPart2.kotlinLocalFunctionWithoutCapturingLocalVariable thrpt 200 908852.870 1822.557 ops/ms

By now, it should be no surprise! Kotlin again seems to beat Java by a tiny margin. And that means, once again, if there's any cost to Kotlin, that cost is negative!

Null safety

The Kotlin compiler adds a null-check on each public function's non-null parameter. But how much does that cost?

Let's see.

Here's the tested Kotlin function:

fun sayHello(who: String, blackHole: BlackHole) = blackHole.consume("Hello $who")

Blackhole is a JMH class that can be used to consume values during benchmarks, ensuring that the compiler cannot just avoid computing the value, making the benchmark meaningless.

And the Java baseline:

public static void sayHello( String who, BlackHole blackHole ) {

blackHole.consume( "Hello " + who );

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part2.KotlinBenchmarkPart2.javaSayHello thrpt 200 73353.725 155.551 ops/ms

c.a.k.part2.KotlinBenchmarkPart2.kotlinSayHello thrpt 200 75637.556 162.963 ops/ms

Again, the cost of using Kotlin is negative, or in other words, using Kotlin seems to help with performance over using Java, going against our predictions based on differences in the bytecode.

Note: I skipped benchmarking the Nullable Primitive Types Part because that is not, in my opinion, a hidden cost of Kotlin as compared to Java, as in Java nullable primitives would also incur the exact same boxing costs as in Kotlin.

Varargs

The cost of using varargs for method parameters, as the author points out, occurs only when you need to use the spread operator to use an existing array as a method argument, something that is not necessary in Java.

So, to test the overhead, we compare the following Java method call:

public static void runPrintDouble( BlackHole blackHole, int[] values ) {

printDouble( blackHole, values );

}

public static void printDouble( BlackHole blackHole, int... values ) {

for (int value : values) {

blackHole.consume( value );

}

With the equivalent Kotlin implementation:

fun runPrintDouble(blackHole: BlackHole, values: IntArray) {

printDouble(blackHole, *values)

}

fun printDouble(blackHole: BlackHole, vararg values: Int) {

for (value in values) {

blackHole.consume(value)

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part2.KotlinBenchmarkPart2.javaIntVarargs thrpt 200 173265.270 260.837 ops/ms

c.a.k.part2.KotlinBenchmarkPart2.kotlinIntVarargs thrpt 200 83621.509 990.854 ops/ms

Finally, a Kotlin hidden cost which you should definitely avoid! Using Kotlin's spread operator, which causes a full copy of the array to be created before calling a method, has a very high performance penalty (and that might increase with the size of the array). In our case, the Java version ran 200% faster than the seemingly equivalent Kotlin version.

Note: The Passing a Mix of Arrays and Arguments Part is also skipped because there's just no equivalent in Java to compare with.

Part 3

Delegated Properties

To measure the actual cost of using delegated properties in Kotlin, I decided to use the most efficient possible equivalent implementation in Java as a baseline. This may not be totally fair to Kotlin as the two things are not the same, and delegated properties enable using patterns that are just not possible in Java.

However, with this in mind, I think it is useful to know just what kind of cost this really entails, even if compared to a Java version that is specifically written for this one case.

The Java baseline uses the following, trivial classes:

class DelegatePropertyTest {

public static String stringValue = "hello";

public static String someOperation() {

return stringValue;

}

class Example2 {

public String p;

public void initialize() {

p = DelegatePropertyTest.someOperation();

}

As you can see, the caller must remember to call initialize on Example2 in order to initialize the p property!

public static void runStringDelegateExample( BlackHole blackHole ) {

Example2 example2 = new Example2();

example2.initialize();

blackHole.consume( example2.p );

}

The Kotlin code uses a delegate class for initializing the p property:

class StringDelegate {

private var cache: String? = null

operator fun getValue(thisRef: Any?, property: KProperty<*>): String {

var result = cache

if (result == null) {

result = someOperation()

cache = result

}

return result!!

}

operator fun setValue(thisRef: Any?, property: KProperty<*>, value: String) {

cache = value

}

class Example {

var p: String by StringDelegate()

}

And the Kotlin test function does roughly the same thing as the Java test function, except it doesn't need to explicitly initialize the Example class property:

fun runStringDelegateExample(blackHole: BlackHole) {

val example = Example()

blackHole.consume(example.p)

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part3.KotlinBenchmarkPart3.javaSimplyInitializedProperty thrpt 200 274394.088 554.171 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinDelegateProperty thrpt 200 255899.824 910.112 ops/ms

Here, we see that there's a small cost, in the order of 10%, associated with using Kotlin delegated properties when compared to manual Java property initialization.

Note: We skip Generic Delegates because, again, the cost they might incur is related to boxing/unboxing primitive types, not the feature itself.

We also skip the Lazy Delegate Part as that's not a hidden cost, just an informational section regarding how to correctly specify lazy delegates synchronization properties.

Ranges (indirect reference)

To find out the cost of using ranges, as Java does not currently have the equivalent concept at all, we compare the performance of the suggested solutions to the posed performance problems in most examples.

To start with, we compare the cost of using a range with at least on indirection, with just using a range directly where it's needed.

The code that uses indirection involves getting the range from a calling a getter:

private val myRange get() = 1..10

fun isInOneToTenWithIndirectRange(i: Int) = i in myRange

As opposed to using a range directly:

fun isInOneToTenWithLocalRange(i: Int) = i in 1..10

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part3.KotlinBenchmarkPart3.kotlinIndirectRange thrpt 200 1214464.562 2071.128 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinLocallyDeclaredRange thrpt 200 1214883.411 1797.921 ops/ms

Even though there seems to be a tiny cost associated with using indirect references to ranges, the cost is not significant.

Ranges (non-primitive types)

Another range cost found by the author is that, when Ranges are used with non-primitive types, a new ClosedRange instance is created even for locally declared ranges, as in this example:

fun isBetweenNamesWithLocalRange(name: String): Boolean {

return name in "Alfred".."Alicia"

}

Hence, the above should be more expensive than this:

private val NAMES = "Alfred".."Alicia"

fun isBetweenNamesWithConstantRange(name: String): Boolean {

return name in NAMES

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part3.KotlinBenchmarkPart3.kotlinStringRangeInclusionWithLocalRange thrpt 200 211468.439 483.879 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinStringRangeInclusionWithConstantRange thrpt 200 218073.886 412.408 ops/ms

It turns out that, yes, it is better to use constant, non-primitive ranges, than locally declared ones, if you need the absolute best performance possible.

The cost of using local ranges instead of a constant range is around 3%, so this one is not something to worry too much about, though.

Ranges (iteration)

One more potential issue with ranges arises when we iterate over them.

Iterating over a primitive range should have zero overhead:

fun rangeForEachLoop(blackHole: BlackHole) {

for (it in 1..10) {

blackHole.consume(it)

}

However, iterating using the forEach method should have an overhead, according to the blog post:

fun rangeForEachMethod(blackHole: BlackHole) {

(1..10).forEach {

blackHole.consume(it)

}

As should iterating over a range created using a step:

fun rangeForEachLoopWithStep1(blackHole: BlackHole) {

for (it in 1..10 step 1) {

blackHole.consume(it)

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part3.KotlinBenchmarkPart3.kotlinRangeForEachFunction thrpt 200 108382.188 561.632 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinRangeForEachLoop thrpt 200 331558.172 494.281 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinRangeForEachLoopWithStep1 thrpt 200 331250.339 545.200 ops/ms

The above graph shows that using the forEach function for a Range, as the author predicted, should absolutely be avoided. It performs 300% slower than a simple for-loop!

On the other hand, using an explicit step does not seem to impact the performance of a range for-loop, contradicting the advice in the blog post.

Iterations: Collection indices

Finally, let's measure the cost of using indices on custom classes that are not optimised by the compiler.

For this example, we create a mock version of SparseArray:

class SparseArray<out T>(val collection: List<T>) {

fun size() = collection.size

fun valueAt(index: Int) = collection[index]

}

As suggested by the author, we extend it with a custom indices property:

inline val SparseArray<*>.indices: IntRange

get() = 0..size() - 1

Now, we iterate over the indices:

fun printValuesUsingIndices(map: SparseArray<String>, blackHole: BlackHole) {

for (i in map.indices) {

blackHole.consume(map.valueAt(i))

}

The better solution, according to the author, is to use lastIndex instead:

inline val SparseArray<*>.lastIndex: Int

get() = size() - 1

fun printValuesUsingLastIndexRange(map: SparseArray<String>, blackHole: BlackHole) {

for (i in 0..map.lastIndex) {

blackHole.consume(map.valueAt(i))

}

RESULT

Benchmark Mode Samples Mean Mean error Units

c.a.k.part3.KotlinBenchmarkPart3.kotlinCustomIndicesIteration thrpt 200 79096.631 134.813 ops/ms

c.a.k.part3.KotlinBenchmarkPart3.kotlinIterationUsingLastIndexRange thrpt 200 80811.554 122.462 ops/ms

Even though it might be slightly safer to use a range from 0 to lastIndex to iterate over a custom collection, the impact of using indices is really small, it seem to only run around 2% slower.

Conclusion

So, which features should you use without concern, and which ones should you avoid?!

The ones with a green tick mark below are the features that can, according to this benchmark, be used without concern (cost below 5%).

The red ones should be avoided if possible, unless performance is just a secondary concern.

In any case, I hope this analysis has demonstrated how, when it comes to performance, the only thing you can be sure about is that without measuring, you know nothing.

Higher-order functions and Lambda expressions

No evidence was found to suggest that Kotlin lambdas and higher-order functions should be avoided. To the contrary, they seem to run faster than Java 8 lambdas.

Companion Objects

Companion Objects have no significant performance costs that could be measured, so there's no reason to avoid them from a performance perspective.

Local Functions

Local Functions, capturing or not, do not seem to affect the performance of Kotlin code.

Null safety

Kotlin null-safety checks appear to have a negligible performance impact that can be safely ignored.

Varargs + Spread Operator

Kotlin varargs, when used with the spread operator, have a high performance cost due to an extra, unnecessary array copy. Avoid it if performance is a concern.

Delegate Properties

Avoid delegated properties on performance-critical code. Even though the overhead is quite small, around 10%, this may be unacceptable in certain circumstances.

Indirect access to Range

No performance impact was observed from accessing Ranges indirectly.

Ranges (local, non-primitive types)

The cost of using local Ranges of non-primitive types is almost insignificant (measured to be around 3%), so only avoid in the most extreme circumstances where performance is the main concern.

Ranges (forEach function)

Absolutely avoid calling forEach on Ranges. The cost is extremely high at around 300%. Hopefully, the Kotlin team will be able to address this problem in time, but for now, it's a bad idea to use it.

Ranges (iteration with explicit step)

Using an explicit step does not seem to impact on the speed of an iteration over Ranges.

Use of indices on custom collection

Using the indices property on a custom Collection did not present a very significant cost over using a range to lastIndex. There is still a cost of around 2%, so using lastIndex may still be good advice for performance-critical applications.

All Results

Benchmark Mode Samples Mean Mean error Units

c.a.k.part1.KotlinBenchmarkPart1.empty thrpt 200 3540527.759 23025.839 ops/ms