Strange String issues in JVM languages

Post date: Nov 27, 2013 8:30:27 PM

I always trusted that I could safely use Strings in Java, after all String is one of Java's core classes and, probably, THE most used class.

That's why I was quite surprised with a few blatant String fails demonstrated in this blog post (in the Musing Mortoray blog) which simply claims, and actually shows, that the String type is broken. Not only in the JVM world, by the way!

I will not repeat what that blog says here, please read that. I just wanted to show here the results I got with my favourite languages.

I used the following test Strings:

"😸😾", "noël", "baffle", "abc"

The first one should look like a happy cat followed by a sad cat (😸😾). The last one, 'abc', I added just to make sure that my own implementation of each function (length, reverse, tail - or substring after the first character -, 3 chars - the first 3 characters - and upper) was not obviously wrong and for this reason I did not even count it in the final score.

Here are the results:

Java 7

Score: 12/15

public static void main(String[] args) {

     for (String s : new String[] { "😸😾", "noël", "baffle", "abc" }) {

         print("***     " + s);

         print("length  " + s.length());

         print("reverse " + new StringBuilder(s).reverse().toString());

         print("tail    " + s.substring(1, s.length()));

         print("3 chars " + s.substring(0, Math.min(3, s.length())));

         print("upper   " + s.toUpperCase());

     }

}

private static void print(String s) {

     System.out.println(s);

}

Result:

***     😸😾

length  4

reverse 😾😸

tail    ?😾

3 chars 😸?

upper   😸😾

***     noël

length  4

reverse lëon

tail    oël

3 chars noë

upper   NOËL

***     baffle

length  6

reverse elffab

tail    affle

3 chars baf

upper   BAFFLE

***     abc

length  3

reverse cba

tail    bc

3 chars abc

upper   ABC

Scala 2.9.2

Score: 11/15

import scala.math._

for (s <- Array( "&#x1f638;&#x1f63e;", "noël", "baffle", "abc" )) {

      println("***     " + s)

      println("length  " + s.length)

      println("reverse " + s.reverse)

      println("tail    " + s.tail)

            println("3 chars " + s.substring(0, min(3, s.length)))

      println("upper   " + s.toUpperCase)

}

Result:

***     &#x1f638;&#x1f63e;

length  4

reverse ?&#x1f638;?

tail    ?&#x1f63e;

3 chars &#x1f638;?

upper   &#x1f638;&#x1f63e;

***     noël

length  4

reverse lëon

tail    oël

3 chars noë

upper   NOËL

***     baffle

length  6

reverse elffab

tail    affle

3 chars baf

upper   BAFFLE

***     abc

length  3

reverse cba

tail    bc

3 chars abc

upper   ABC

Groovy 1.8.6 2.2.1

Score: 9/15 12/15

for (s in [ "&#x1f638;&#x1f63e;", "noël", "baffle", "abc" ]) {

        println("***     $s");

        println("length  ${s.size()}")

        println("reverse ${s.reverse()}")

        println("tail    ${s[1..-1]}")

        println("3 chars ${s[0..<Math.min(3, s.size())]}")

        println("upper   ${s.toUpperCase()}")

}

which

Result:

***     &#x1f638;&#x1f63e;

length  4

reverse &#x1f63e;&#x1f638;

tail    ?&#x1f63e;

3 chars &#x1f638;?

upper   &#x1f638;&#x1f63e;

***     noël

length  4

reverse lëon

tail    oël

3 chars noë

upper   NOËL

***     baffle

length  6

reverse elffab

tail    affle

3 chars baf

upper   BAFFLE

***     abc

length  3

reverse cba

tail    bc

3 chars abc

upper   ABC

Ceylon 1.0.0

Score: 14/15

shared void run() {

    for (s in [ "&#x1f638;&#x1f63e;", "noël", "baffle", "abc" ]) {

        print("***     ``s``");

        print("length  ``s.size``");

        print("reverse ``s.reversed``");

        print("tail    ``s.rest``");

        print("3 chars ``s[0..(min({2, s.size - 1}))]``");

        print("upper   ``s.uppercased``");   

    } 

}

Result:

***     &#x1f638;&#x1f63e;

length  2

reverse &#x1f63e;&#x1f638;

tail    ?&#x1f63e;

3 chars &#x1f638;&#x1f63e;

upper   &#x1f638;&#x1f63e;

***     noël

length  4

reverse lëon

tail    oël

3 chars noë

upper   NOËL

***     baffle

length  6

reverse elffab

tail    affle

3 chars baf

upper   BAFFLE

***     abc

length  3

reverse cba

tail    bc

3 chars abc

upper   ABC

Conclusion

I was a little surprised that Java did not win. Being around for so long and being used as a back-end for websites all over the world for a long time, I thought that it would be completely encoding-error proof by now.

But considering all Java errors happened when handling a rather uncommon set of characters (cat faces?!) I believe this is not so worrying.

The pleasant surprise was Ceylon, which had only one fail, hence becoming our sole winner as the best JVM-language for handling Strings! Not bad for a newcomer still on version 1.0.0**.

Scala, for some reason, got not only the "cat tail" wrong, like Java and Ceylon (all languages failed this test!), it also failed to reverse the cats correctly, to get the first 3 (or 2 if there were only 2) chars right, and to know that there were 2 cats, not 4. But at least all Scala errors were due to the mischievous cats!

The disappointment in the list, however, was Groovy (I am sad to say this, as have a lot fun writing Groovy code and would like to see it perform better). It got a worriesome 6 failures! It even managed to get the lenght of a simple, actual English word - baffle -, completely wrong. This could cause horrible bugs in someone's application. It also failed to reverse baffle. But I must add here that I used an outdated version of Groovy (Groovy is now on 2.2.0, I used 1.8.6). Groovy advocates can blame whoever is behind apt-get, which currently provides the version I used. I will try to update the results using Groovy 2.2 later, and I hope these figures will improve!

But there's an important lesson here!

Not all JVM languages handle Strings (and several other things, specially numbers) in the same way.

** the Ceylon team, after reading this post, told me they fixed this issue and can pass all 15 tests now!