The Numbers Mini-Language 


Stevey's Drunken Blog Rants™

This month I was planning on writing an article on the importance of mini-languages. I was hoping to give you a guided tour of a bunch of mini-languages you use all the time, and my conclusion was going to be that writing your own mini-languages can sometimes save you a lot of code and a lot of mistakes.

However, the first mini-language I chose to talk about — numbers — was so rich that it turned into an entire article. Numbers don't get much attention in most programming books, and yet they're surprisingly complex. (No pun intended.) So I'll give a little exposition on some dialects of the numbers mini-language, and hopefully take a few potshots at Perl and C++ while I'm at it.

Scientific Notation

 

How would you represent Avogadro's Number in a programming language?

You may recall from chemistry that Avogadro's Number (or NA, also known as a "mole") is the number of atoms needed such that the number of grams of a substance equals the atomic mass of the substance. Its value is approximately 6.022 x 1023. Written out, that's 602,200,000,000,000,000,000,000. Roughly.

Unfortunately neither mathematical notation (standard or scientific) is used in most programming languages. Standard U.S. comma-separated notation isn't used because the comma already has another meaning, such as a statement or list-item separator. For instance, in C/C++/Java, you can say:

int foo[] = {123,456,789};

If they allowed commas in numbers, the declaration above would be ambiguous. Similarly, you can't use true scientific notation, since you can't specify a superscripted number in the source code for most languages. Plus the "x" in "6.02 x 1023" is ambiguous, since it could be a variable name. The parser could figure it out, but numbers are almost always processed by the lexer, which doesn't have enough information to know it's not an identifier.

In short, languages don't allow scientific notation because language designers are either too dumb or too lazy to make it work, not because it's impossible.

Instead, most languages provide syntax that tries to approximate scientific notation. You can, for instance, say

double NA = 6.02e23;

which is taken by the compiler to mean you're assigning 6.02 x 1023 to the variable NA. Case is ignored for the letter "e", and the "+" sign on the mantissa is optional, so these are all equivalent:

double NA = 6.02e23;
double NA = 6.02E23;
double NA = 6.02E+23;
double NA = +6.02E+23;

You can't put a space after the 'e', or the compiler will become confused. That's strange, because there's no other possible interpretation of the token "6.02e" followed by the token "23". But parsing numbers is quite a bit of work already, so most language designers take shortcuts to make it tractable. I guess numbers just aren't glamorous enough to get designers to exert themselves on our behalf.

Grouping Separators

Some languages allow you to use underscores as a grouping separator, to make numbers more readable. For example, Perl lets you say:

my $x = 602_200_000_000_000_000_000_000;

That leaves some open questions, though. Do you have to put the underscores exactly where commas are "legal" in the standard notation? Or you can you insert commas willy-nilly? Can you end in an underscore? Can you have multiple underscores? Can you begin or end a number in an underscore?

In Perl, all underscores "evaporate" from numbers, so the following numeric representations are all equivalent to 2.05:

2.05
2_.__0____5____
2_____.____0___5__________

As long as it starts with a numeric digit, Perl starts discarding underscores. Well, except if it's scientific notation. Perl typically has at least one "except" clause for every rule. Helps lock you in.

Other languages are less forgiving -- that is, if "forgiving" is the right word for a language that lets you obfuscate something as fundamental as a number. In Ruby, for instance, you can "only" use one underscore in a row, and numbers can't end in an underscore.

Ruby does let you use underscores in the exponents of scientific notation, so this is the same as 2 to the tenth power:

2e1_0

This probably isn't very useful, but at least it's consistent.

Java, C and C++ do not have a number-grouping separator, so you'll often see constants declared like this:

double ONE_MILLION = 10 * 1000 * 1000;

Virtually all compilers are smart enough to figure out that constant expressions can be computed at compile-time. But it's not very convenient, especially if you want to represent a number that's not a power of ten. You could certainly say:

double x = 123 * 456 * 789;

but it's probably not what you meant. To represent the number 123,456,789 in C/C++/Java, you'd likely say one of the following:

double x = 123456789;
double x = (123 * 1000 * 1000) + (456 * 1000) + 789;

Grouping separators were a good idea.

Radixes

 

Unlike people, computers can only count on one finger, so they express numbers internally in base-2 (binary). Some programming languages let you specify numbers in binary form. Ruby, for instance, lets you say things like

0b11110000
=> 240

and

0b111_000_111
=> 455

but it doesn't let you specify binary numbers in scientific notation. You can't say:

0b1111_0000e11
SyntaxError: compile error

Which is all well and good, because if you did that, people would think you were pretty weird.

C, C++ and Java don't support binary literals. You have to specify them in octal or hexadecimal. It's possible to convert them to binary in your head, if you can count to 7 in octal and F in hexadecimal. But it's error-prone, so to assist you'll often see comments like this in C++ and Java code that manipulates bit fields:

// if (i & 1 1111 1111 1111b == 0)
if (i & 0x1FFF == 0) ...
// if (i & 1010101010101b == 0)
if (i & 0x1555 == 0) ...
// if (i & 1001001001001b == 0)
if (i & 0x1249 == 0) ...
// if (i & 1000100010001b == 0)
if (i & 0x1111 == 0) ...
// if (i & 0010000100001b == 0)
if (i & 0x421 == 0) ...

This is a perfect example of where a mini-language (or more precisely, a mini-mini-language) would have made things slightly easier in some situations. Not having binary literals is just dumb.

Most languages provide syntax for specifying numbers in octal (base-8) or hexadecimal (base-16). In most languages, octal numbers start with 0, and hex numbers start with 0x, copying the style of the C language.

Emacs Calc allows you to specify numbers in any radix from 2 to 36, using the syntax: radix#number. Example: it tells me that 17#CAFEBABE ("cafebabe" in base-17) is 5,187,964,049 in base-10. I'm, uh, sure it's right.

Generally, people prefer to use plain old base-10 numbers unless they're interpreting the numbers as bit fields. As a result, base-10 numbers get most of the mini-language love.

Complex Numbers

 

Some languages have built-in support for complex numbers. Complex numbers, as you may recall, are vectors in the complex plane, and are a superset of the real numbers. Complex numbers are usually expressed in mathematics texts in the form x + iy, where i is the imaginary unit equal to the square root of -1.

Not all languages provide built-in support for complex numbers. Those that have it usually provide it in the form of a library.

Java has no support for complex numbers, even in the latest release, JDK 5.0. So java.math is still an important-looking top-level package with almost nothing in it.

C++ supports complex numbers with the the complex class. As with everything else in C++, it's complicated to use, and is filled with unintuitive behavior and traps for the unwary. This will come as no surprise, I should think.

Perl and Ruby support complex numbers through libraries that come with the standard language distribution: Perl's Math::Complex, and Ruby's complex and mathn (Unified Math) modules.

Python, Common Lisp and Scheme have built-in language-level support for complex numbers. They all provide built-in functions for extracting the real and imaginary parts, and for treating the numbers as polar or rectangular coordinates in the cartesian plane.

Once you admit complex numbers into your language, you've started down the path to supporting mathematical vectors, which are one-dimensional matrices, and then you need to decide whether to support multi-dimensional matrices and the "standard" matrix operations. Not many languages have syntactic language support for matrices, so using them becomes an awkward exercise in calling functions on objects.

Precision

 

 

In most programming languages, the representation of numbers is closely tied to the capabilities of the hardware. You'll often see a set of limited-precision numeric types (e.g. byte, short, int, long, float, double) that can be mapped onto the word size(s) of Intel hardware. This is to make it convenient for the compiler to generate efficient number-crunching code, by using built-in CPU facilities.

C, C++ and Java all have primitive numeric types that conveniently map to "modern" hardware register sizes of 32 and 64 bits. All three languages support "bignums", which are arbitrary-precision numbers that never overflow, but their bignum support is provided via cumbersome libraries. By default, these languages strongly encourage you to use the primitive types for manipulating numbers in your programs.

This design decision is, of course, a premature optimization that comes with disastrous consequences. ints, longs, doubles and their ilk are some of the most evil, bug-ridden constructs in "modern" programming languages. We tend to take this evil for granted, similar to how we take it for granted that people get killed every day by drunk drivers. Fatal accidents are just a fact of life, and you need to look both ways before you cross the street. Such is the life of a C++ or Java programmer.

Compilers can do only very limited checking to ensure your numeric computations aren't producing an overflow. The following Java code produces a compiler error, saying the number is too large:

int y = 1219326311126352690;

However, the Java compiler will accept this quite happily, even though it's the same value, one that can be computed at compile-time:

int y = 1234567890 * 987654321;

The Gnu C compiler (gcc) is smart enough to issue a warning for the expression above: "warning: integer overflow in expression". (Jeez, shouldn't that be an error?) But even so, one more modification is enough to fool it:

int x = 987654321;
int y = x * 1234567890;

This code compiles and runs without a whiff of complaint. Although the compiler could try to detect this case, it is not in general possible to detect at compile-time when an overflow will occur, so most compilers don't try to detect overflows that span multiple expressions. Doing so could lull programmers into an even falser sense of security than they've already got.

Encouraging Disaster

 

Integer overflow is easy to produce, and it results in incorrect program behavior. Some kinds of incorrect behavior are worse than others. If the program simply draws something incorrectly, it may be acceptable. If the overflow crashes the program, it may be harder to swallow. If your website or service goes down, you can lose money and customers.

But crashing isn't the worst that can happen, not by far. In fact, most of the time, crashing (or at least throwing an exception) would be the best thing that could happen in the event of an integer overflow. Crashing is probably preferable to, say, performing a financial transaction incorrectly. And even that isn't the worst that can happen, since usually you can manually recover from an incorrect financial transaction, as soon as someone notices it.

The worst thing that can happen as a result of integer overflow is a security vulnerability that gives an attacker control over your machine(s). That could trigger a wide variety of numerically valid financial transactions, all of which you'd probably rather avoid.

The problem is exacerbated by the machine-dependence of the data types. "int" may mean different things on different hardware, so even if your code is thoroughly tested and debugged on one platform, porting it may expose the problems all over again.

Sadly, C and C++ are still two of the most commonly-used languages on the planet, and both of them make these kinds of disasters commonplace.

Earlier I claimed that having fixed-precision numeric types in your language is a premature optimization. It is. Most numeric processing is non-performance-critical, at least outside the realm of scientific and engineering applications, and game programming.

Supporting Bignums in a library doesn't help matters, because most programmers won't use them. Programmers will do whatever the language makes it easiest to do. They'll type "int" instead of "long" because it's 25% fewer characters to type, and because they've heard dark hints that long updates aren't atomic operations on some processors. Plus, int seems plenty big enough for most purposes.

In fact, chances are very good that you've never used a BigNum class library directly. Most programmers haven't. If you've ever run into a situation where int isn't big enough, then you probably just switched it to a long and went on your merry way.

Even looking at the java.math.BigInteger documentation makes me want to avoid it. Look at the API documentation for the following BigInteger constructor:


BigInteger

public BigInteger(int signum,
byte[] magnitude)
Translates the sign-magnitude representation of a BigInteger into a BigInteger. The sign is represented as an integer signum value: -1 for negative, 0 for zero, or 1 for positive. The magnitude is a byte array in big-endian byte-order: the most significant byte is in the zeroth element. A zero-length magnitude array is permissible, and will result in a BigInteger value of 0, whether signum is -1, 0 or 1.

Parameters:
signum - signum of the number (-1 for negative, 0 for zero, 1 for positive).
magnitude - big-endian binary representation of the magnitude of the number.
Throws:
NumberFormatException - signum is not one of the three legal values (-1, 0, and 1), or signum is 0 and magnitude contains one or more non-zero bytes.

If the language makes it that hard to use something, you're not going to use it unless you're backed into a corner. I can't even imagine trying to calculate anything meaningful using Java's Bignum system. I don't even know how you'd do something simple like the quadratic formula or the distance formula; BigInteger appears to lack a facility for taking square roots.

A language with easy-to-use fixnums and hard-to-use bignums is encouraging programmers to flirt with disaster.

Java could make this situation far better simply by providing some syntax for Bignums. Imagine being able to say this:

BigInteger x = 123456789 * 987654321;
BigInteger y = x + 1;

Not much going on here; it's certainly possible to set something like this up in C++ using operator overloading. But Java has no mechanism for extending the language syntax. I'm not asking for horrifically ugly and error-prone Templates à la C++. A simple mechanism for operator overloading (such as Ruby's, which is much cleaner than its C++ cousin) would suffice.

Languages that make things unavoidably error-prone just suck.

Arbitrary Precision to the Rescue

 

Fixed-precision numeric data types are an artificial construct. In the real world, numbers have infinite bounds. Numbers form a heirarchy of types: integers, rationals (ratios of integers), reals (rationals plus an infinity of irrational numbers like e and pi), and complex numbers. In the ideal scenario, a programming language would default to these four "true" numeric types, all with arbitrary precision.

But computers are non-ideal. Even with Bignums, you're limited by the storage available on your computer (not to mention the processing time). So all programming languages have to make some concessions to the realities of computer hardware. For a given language, the question is whether they've made good trade-offs.

Generally, I think a good trade-off is one that makes it easiest to do the right thing, and then make it possible to optimize for performance after you know it's correct. This policy will still admit bugs, but far fewer of them, because most of the time you won't need to go in and optimize your number-crunching. It'll be something else -- I/O or a polynomial-time algorithm most likely -- that causes most of your performance bottlenecks.

Ruby, Python, and Scheme all have numeric systems that favor correctness by default. The following session illustrates Ruby's approach:

irb(main):022:0> a = 100_000_000
=> 100000000
irb(main):023:0> a.class
=> Fixnum
irb(main):024:0> a = a * a
=> 10000000000000000
irb(main):025:0> a.class
=> Bignum
irb(main):026:0> a += 1.0
=> 1.0e+16
irb(main):027:0> a.class
=> Float

In Ruby, numbers are objects. If they're small enough to fit within the conventional 32-bit word size, then they're represented as Fixnum objects, which internally store the value in a word. If you perform an operation that overflows, the object's class actually switches to Bignum or Float in order to make room. Bignum and Float both have semi-optimized implementations, using only as much precision as they need in order to store the current value.

Ruby encourages correctness by default, which is great. However, it appears that the only way to optimize your number-crunching in Ruby is to rewrite the relevant pieces in C. Both Lisp and Ada provide slightly better solutions.

The Scheme programming language (a dialect of Lisp) has a "numeric tower" of subtypes: integer, rational, real, complex, number. Additionally, Scheme numbers are either exact or inexact. Inexactness is a contagious property of a number; any inexact operation will render the result of the whole expression inexact. Scheme has syntax for specifying numbers in binary, octal, decimal and hexadecimal, and for specifying whether the number begins life as exact or inexact. There's even syntax for specifying the desired precision of inexact numbers (short, single, double, long), although the actual bit-widths of these specifiers are implementation-dependent. But overall Scheme has one of the richest mini-languages for specifying and manipulating numeric values.

In Common Lisp, like in Ruby, numbers and their operations have arbitrary precision by default. Unlike Ruby, though, Common Lisp provides sophisticated mechanisms for going back and optimizing your numeric processing where it's needed, by declaring types and precisions. So Common Lisp offers a very good trade-off: it encourages correctness, but permits optimization.

Ada 95 takes a different approach: there are no "unlimited precision" numbers (as far as I know). The language forces you to specify the precision of any numbers you declare, and it throws errors on overflow and underflow. As a result, it's fast and safe. The downside is that it doesn't give you a convenient "out" when you're doing non-performance-critical code. You always have to declare your precisions, and it forces you to stick with them.

Perl appears to have bignum support by default, as in Ruby, but I'd have to be hard-pressed to say Perl avoids being error-prone. Hard-pressed.

The important thing to realize about arbitrary-precision arithmetic is that nobody will use it unless your numbers mini-language supports it by default. You can build Object-Oriented interfaces until you're blue in the face, but the Bignum example shows that a real language (even a small one) with real syntax can improve usability and user adoption by orders of magnitude.

Conclusion

 

At this point, I'm 1____0_0.000__00_____% confident that I didn't bash on Perl enough. Maybe next $time.

I'm also confident that regardless of which language you favored before reading this article, you still favor it now. Even if it's C++. It's hard to re-open a closed mind.

But even if you aren't going to switch to an easier, safer, saner language any time soon, I think you have to admit you're glad your language has a numbers mini-language built into its syntax. Imagine having to use the equivalent of a bignum class every time you wanted to do floating-point, or hexadecimal notation. Ouch. I hope I've at least partially introduced you to the power that domain-specific mini-languages can bring to programming.

Maybe next time I'll get a chance to talk more about them. My eventual goal is to show you that with the right set of custom mini-languages, we could cut our 100-million line Amazon code base by (I'd guess) 80% or more, with no noticeable performance penalty, and far greater reliability. But's sort of a long-term goal; it could take me years to establish a reasonable proof. Don't hold your breath.

In the meantime, have fun with this month's programming challenge! I recommend you use a language that supports polar coordinates. (*Hint hint*)

p.s.

 

While I was writing the section on radixes, I noticed that Perl and Ruby both allow you to specify hexadecimal numbers in scientific notation, e.g.:

irb(main):038:0> 0x16e22
=> 93730

but not octal or binary numbers. Go figure.

I spent almost an hour puzzling over various hypotheses about how the hexadecimal-scientific notation was supposed to work. For instance, 0x5e5 should be either 5 x 105 (500,000) or 5 x 165 (5,242,880), but it was coming out as 1509. Was it doing the shifting in binary? Adding some offset? It wasn't discussed anywhere in the Perl or Ruby documentation. I tried at least 20 combinations, starting at 0x1e0 (mysteriously, 480, or 111100000 in binary). Perl and Ruby were both computing the same results. I was almost ready to chalk it up to an undocumented bug in the number-parsing code.

I was rather embarrassed when I finally figured it out, at 3:30am. Time for bed!

(Published Jan 16th, 2005)