Posts‎ > ‎

Faster command line tools with Kotlin

posted Aug 6, 2017, 5:35 AM by Renato Athaydes   [ updated Aug 6, 2017, 5:37 AM ]
Inspired by Faster command line tools with Haskell 
    which itself was inspired by Faster command line tools with Go
        (which inspired Faster command line tools with Nim)
        which itself was inspired by Faster Command Line Tools in D.

In this article, I will show how Kotlin (when run on the JVM, in this case) can be used to create fast command-line tools that run at similar speed to the fastest native languages.

Even though that is the main goal, I will also go beyond the basics to show how programs like this can be optimised to a high degree with the help of basic JVM tools and some intuition.

To recap from the original D blog post:

The original problem statement:

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.

The specific exercise we’ll explore starts with files having keys in one field, integer values in another. … Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments.



From previous benchmarks I did on Kotlin, I know Kotlin is a pretty fast language... at least it is in the same league as Java and Go, which is a great thing!

Note: when I talk about Kotlin performance, I am talking about running Kotlin with the Oracle JVM, HotSpot. In particular, I've used Java version 8u141, Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode).

Kotlin version 1.1.3-2 compiled for Java 8 (kotlin-stdlib-jre8)

However, the JVM is not known for good performance for small scripts like the one used in this experiment due to the slow startup time and the fact that it may not even have time to warm up properly (i.e. the JIT compiler does not have time to compile efficient machine code, so the JVM ends up just interpreting bytecode most of the time).

Anyway, let's get right into it and see how Kotlin and the JVM perform!


V1 - Initial Version


The first version of the code looks almost exactly the same as the original D code (allowing for syntax differences):


val delim = '\t'

private fun run(file: File, keyIndex: Int, valueIndex: Int) {
val maxFieldIndex = maxOf(keyIndex, valueIndex)
val sumByKey = mutableMapOf<String, Int>()

file.forEachLine { line ->
val fields = line.split(delim)

if (maxFieldIndex < fields.size) {
sumByKey.merge(fields[keyIndex], fields[valueIndex].toInt(), Int::plus)
}
}

if (sumByKey.isEmpty()) {
println("No entries")
} else {
val (key, maxEntry) = sumByKey.maxBy { it.value }!!
println("max_key: $key, sum: $maxEntry")
}
}



The boilerplate of parsing arguments and handling errors is omitted for brevity, but you can see it here.

To run the program, we type:

java -cp "build/libs/*" Max_column_sum_by_keyKt ngrams.tsv 1 2


Here's the result:


max_key: 2006, sum: 22569013
Total time: 3.37 sec


Note: all times reported in this article are self-reported by the Kotlin code, similar to what was done in the Go and Haskell articles.
Using the shell's time command to measure the time adds at most 0.1 seconds to all reported times. 
In any case, comparing the times obtained by different languages using different hardware
should be seen only as a means to get an idea of whether speeds are around the same order of magnitude.


Given the first D version ran in 3.2 seconds (and Go, in 3.49s), Kotlin is not looking too bad!

However, we can definitely improve on this!

Let's see what the poor man's profiler on the JVM, jvisualvm (with the Startup Profiler Plugin), has to say about our program so far:


profiler results 1


First of all, Kotlin seems to be introducing a lot of runtime null checks! We probably don't need that, so it would be nice to find out where that's being called from!


It is possible to disable null checks at runtime, but I won't do that as the options to do so are currently unofficial.


Using the Visual VM Sampler to inspect the call stack, the cause becomes obvious:


call stack



V2 - Java String::split


Kotlin's extension function CharSequence::split, which we use to split each line, seems to be guilty. Even though this function is much more powerful than Java's, all we need here is the simpler Java version, so let's replace the line:


val fields = line.split(delim)


With this:


@Suppress("PLATFORM_CLASS_MAPPED_TO_KOTLIN")
val fields = (line as java.lang.String).split(delim)


Because Java's split takes a String rather than a char, the type of delim had to be changed as well. See the full code here.

Result:


max_key: 2006, sum: 22569013
Total time: 2.73 sec


Nice! That's over 0.6 seconds off the first version with a one-line change!

And this dramatically changes the profiling characteristics of our application:


v2 profiling


The bottleneck now seems to be related to simply iterating over the file's lines.

Other than doing something like the Go guy did and iterate over the file bytes directly rather than the line Strings to solve that, we could try a few simpler things first that might help...

V3 - Wrap Integers into IntBox


Let's see if we can avoid any Integer boxing by implementing a int wrapper, thus avoiding lots of boxing/unboxing values to/from the Map with the results.

Here's the new code with just some simple modifications based on the introduction of the IntBox data class:


data class IntBox(var int: Int = 0) {
operator fun plus(other: Int) {
int += other
}
}

const val delim = "\t"

private fun run(file: File, keyIndex: Int, valueIndex: Int) {
val maxFieldIndex = maxOf(keyIndex, valueIndex)
val sumByKey = mutableMapOf<String, IntBox>()

file.forEachLine { line ->
@Suppress("PLATFORM_CLASS_MAPPED_TO_KOTLIN")
val fields = (line as java.lang.String).split(delim)

if (maxFieldIndex < fields.size) {
sumByKey.getOrPut(fields[keyIndex]) { IntBox() } + fields[valueIndex].toInt()
}
}

if (sumByKey.isEmpty()) {
println("No entries")
} else {
val (key, maxEntry) = sumByKey.maxBy { it.value.int }!!
println("max_key: $key, sum: ${maxEntry.int}")
}
}



As the box allows the code to always perform operations on primitive int's (new values are not put into the Map, the current box's value is just changed), this could run faster, in theory.

The result:


max_key: 2006, sum: 22569013
Total time: 2.52 sec


Hm.. not a huge difference, just a couple of tenths of a second (but faster). Running this a few times, the time does not vary significantly...

One thing we can do now to get to the bottom of this, is to include java.* classes in the profiler configuration (by default, they are excluded) and see exactly where time is being spent:


Profiler data with Java classes


Unfortunately, looks like now we're down to the lowest level Java methods taking up most of the time: copying array ranges (called by the split method), the String constructor itself (there's a lot of Strings being created!), String::indexOf (also from split)...

V4 - Faster Map


I had thought of trying to use a faster Map implementation, but the profiler's information does not seem to back up the assumption that doing that would help much!

But I am stubborn, and decided to try anyway... first, by explicitly instantiating a java.util.HashMap instead of using Kotlin's mutableMapOf function (which gives a LinkedHashMap).

Then, by using a heavyweight library called fastutil, which contains lots of fast implementations of primitive collections for Java (which we can use freely from Kotlin, of course)... I decided to try Object2IntOpenHashMap. It even has an appropriate method with signature

    int addTo(final K k, final int incr)

which I thought would give us at least a modest boost in speed (notice that this method allowed me to get rid of IntBox).

But, as usual with performance guesses, I was wrong!

Here's the result using HashMap:


max_key: 2006, sum: 22569013
Total time: 2.39 sec


And using Object2IntOpenHashMap:


max_key: 2006, sum: 22569013
Total time: 2.58 sec


Well, it turns out fastutil was not too fast on this problem (the library seems to optimise for large collections, not only speed)... but at least we're gonna take the tiny improvement of using HashMap, which brings the total time down another 100ms from the previous result to a respectable 2.39 seconds for a non-native, VM-based language (the only one in the articles I've read so far besides the mention of Python in the D article... it's worth mentioning that Kotlin Native is being developed, so Kotlin may not always rely on the JVM for best performance).

And the code so far remains extremely clean.

But we're still well behind D and Nim, and not quite better than Haskell yet.

V5 - Process bytes, not Strings


To make things even faster, it seems we'll need to bite the bullet and get rid of all those byte-to-String + String::split operations. They are not strictly necessary anyway, as most of the work can be done on the byte level, as shown in the Go article.

But how less readable and maintainable will the code become?!

Judge it by yourself:


const val delim = '\t'

private fun run(file: File, keyIndex: Int, valueIndex: Int) {
val maxFieldIndex = maxOf(keyIndex, valueIndex)
val sumByKey = HashMap<String, IntBox>()

val inputStream = file.inputStream()

val buffer = ByteArray(1024 * 1_000)
var fieldIndex = 0
val currentKey = StringBuilder(12)
val currentVal = StringBuilder(12)

fun startLine() {
if (currentVal.isNotEmpty()) {
sumByKey.getOrPut(currentKey.toString()) { IntBox() } + currentVal.toString().toInt()
}

fieldIndex = 0
currentKey.setLength(0)
currentVal.setLength(0)
}

inputStream.use {
while (true) {
val bytesCount = inputStream.read(buffer)

if (bytesCount < 0) {
break
}

(0 until bytesCount).forEach { i ->
val char = buffer[i].toChar()

if (fieldIndex <= maxFieldIndex) {
when (char) {
delim -> {
fieldIndex++
}
'\n' -> {
startLine()
}
else -> {
if (fieldIndex == keyIndex) {
currentKey.append(char)
} else if (fieldIndex == valueIndex) {
currentVal.append(char)
}
}
}
} else if (char == '\n') {
startLine()
}
}
}
}

if (sumByKey.isEmpty()) {
println("No entries")
} else {
val (key, maxEntry) = sumByKey.maxBy { it.value.int }!!
println("max_key: $key, sum: ${maxEntry.int}")
}
}


Result:


max_key: 2006, sum: 22569013
Total time: 1.39 sec


Really good speed up!

V6 - ByteArray conversion straight into Int


But notice that we're still wastefully converting every value into a String before parsing it as an int, something that can be avoided without much effort...

The code doesn't change too much from the last one shown above, so you can just click here to see the diff if you're interested!

Here's just the ByteArray-to-Int function used:


private fun ByteArray.toIntUpTo(maxIndex: Int): Int {
var result = 0
var multiplier = 1
((maxIndex - 1) downTo 0).forEach { index ->
val digit = this[index].toInt() - 48
result += digit * multiplier
multiplier *= 10
}
return result
}


And here's the result after doing that:


max_key: 2006, sum: 22569013
Total time: 1.18 sec


This is pretty close to the best D version!

But we still have one card up our sleeves with Kotlin!

V7 - Parallelise


Given that the current fastest implementation we have can be easily parallelised, there's no reason to not try doing that.

All we need is a Java's RandomAccessFile to be able to read different parts of the file concurrently, and some setup code to break the file up into partitions.

Here's the setup code:


private fun run(file: File, keyIndex: Int, valueIndex: Int) {
val maxFieldIndex = maxOf(keyIndex, valueIndex)

val partition1 = RandomAccessFile(file, "r")
val partition2 = RandomAccessFile(file, "r")

val fileLength = file.length()
var secondPartitionStartIndex = fileLength / 2L

partition2.seek(secondPartitionStartIndex)

while (true) {
val byte = partition2.read()
secondPartitionStartIndex++
if (byte < 0 || byte.toChar() == '\n') {
break
}
}

val firstPartitionResult = CompletableFuture.supplyAsync {
maxOfPartition(partition1, keyIndex, valueIndex, maxFieldIndex, secondPartitionStartIndex.toInt())
}

val secondPartitionResult = CompletableFuture.supplyAsync {
maxOfPartition(partition2, keyIndex, valueIndex, maxFieldIndex, fileLength.toInt())
}

val sumByKey = firstPartitionResult.join().toMutableMap()
.mergeWith(secondPartitionResult.join())

val maxEntry = sumByKey.maxBy { it.value.int }

if (maxEntry == null) {
println("No entries")
} else {
val (key, value) = maxEntry
println("max_key: $key, sum: ${value.int}")
}
}



Most of the code that was in the run function in the previous examples was moved to a new function called maxOfPartition, which only scans a partition of the file and returns the sumByKey Map only for its partition. After computing the maps for each partition in parallel, we merge them together to get the final result.

Why not use Kotlin coroutines?

Why did I not use Kotlin coroutines for this?

Kotlin coroutines are still experimental, but they are expected to become one of the bigger and coolest features of the language.

However, when I tried to write this code using coroutines, I ran into a compiler bug which made it impossible to use them, unfortunately.

If and when JetBrains addresses this bug, I might update this blog post with the code using coroutines, but for now, CompletableFuture it is.



Here is the result when reading two separate partitions of the file in parallel:


max_key: 2006, sum: 22569013
Total time: 0.71 sec


Even though I got some variance between different runs, the above result seems to be around the average for this code (slower runs ended in around 0.9 seconds, but the fastest runs were down at 0.67 seconds, with most runs close to the latter than the former).

So, even though we had to employ some pretty advanced stuff compared to the simple v1, we were able to increase the speed from over 3 seconds to well under 1 second. And, remember, we're processing a 192.4MB file.

So, I hope this shows that writing Kotlin command-line application that run fast is possible... and yes, you can get speeds similar to Go and D.


Bonus solution

Bonus solution: MappedByteBuffer

As a bonus solution, I decided to map the file bytes into a MappedByteBuffer, which makes reading files extremely fast!

I just replaced RandomAccessFile with MappedByteBuffer, making the necessary changes to compile, but otherwise keeping the last version of the code unchanged (i.e. two buffers running in parallel).

This is not a fair solution as it maps the whole file into memory rather than consuming small chunks of it at a time like the previous solutions did (which allows reading files of any size, not just whatever fits into memory).

In any case, if you don't need to worry about files that won't fit into memory, this would be the fastest thing to do.

Here's the result:

max_key: 2006, sum: 22569013
Total time: 0.58 sec


And that's our absolute best time!



Please leave your comments on Reddit.


Summary of results:

V1 - 3.37s

V2 - 2.73s (java.lang.String::split)

V3 - 2.52s (IntBox)

V4 - 2.39s (java.util.HashMap)

V5 - 1.39s (process bytes, not Strings)

V6 - 1.18s (ByteArray -> Int)

V7 - 0.71s (parallelise)

(Bonus version) - 0.58s (MappedByteBuffer)


The (gzipped) file used as an input can be downloaded from here:


Repository containing all the code shown in this blog post:






Comments