For a few years now, I’ve been meaning to read Peter Norvig’s old book, “Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp”, but every time I started I’d inevitably get lost in the code due to my poor Lisp skills. For some reason I just couldn’t absorb the introductory Lisp chapters at the beginning of the book. So a little while ago I decided to “begin at the beginning” and slowly started making my way through Peter Seibel’s “Practical Common Lisp”.
Inevitably, one of the first things I learned about was the “format” function. Format, is essentially the Lisp equivalent of printf, so you kinda need it to produce the old “Hello World”. Just like printf and its ilk, format supports string interpolation via embedded directives. For example if you wanted to print two strings side by side but separate them by a specific number of spaces you could do something like this:
CL-USER> (format t "~a~40t~a~%" "abc" "123")
abc 123
NIL
One of the most interesting format directives is ~r which will take an integer and produce its English representation e.g.:
CL-USER> (format nil "~r" 123456)
"one hundred twenty-three thousand four hundred fifty-six"
I found this little gem in one of the footnotes, the lesson here being – always read the footnotes. Anyway, I don’t know about you, but I thought that was pretty awesome, which immediately got me thinking about how hard it would be to implement a stand-alone integer to English translation function. I would have loved to try my hand at doing this in Lisp, but I don’t think I am sufficiently Jedi for that as yet (I will however give it a go as soon as I am ready, you will not want to miss that :)), so instead I decided to have a go at implementing it in Ruby.
A Totally Unrelated Musical Trivia Interlude
Speaking of Jedi, I’ve recently been a little obsessed with the “Good Morning” song from “Singing In The Rain”, you know the one I mean. I blame “Family Guy” for this, it all started after I watched the episode where they perform this song in one of their postmodern flashes of randomness. Anyway, it’s just such a nice and happy song, it’s kinda quaint how 1:30 used to be considered “late” in those days and I dig the whole synchronized tap dancing thing that they do (just imagine how much they had to practice to get it right).
Now, here is the trivia, the girl in the song/movie is a 20 year old Debbie Reynolds (did you know that before “Singing In The Rain” she didn’t even know how to dance) who went on to marry singer Eddie Fisher. A few years later she gave birth to a daughter named Carrie (i.e. Carrie Fisher) who, in turn, went on to play Princess Leia in “Star Wars”. That’s how all of this relates to Jedi, and you thought I was going off on a random tangent – now you know better :). Time to get back to some Ruby code.
A Ruby Based Integer To English Converter
It turns out writing it was somewhat harder than I expected because the English language is a little insane when it comes to naming numbers. Everything was going well at the start; it was obvious that we need some kind of lookup table to convert digits to words. The lookup table must contain all the uniquely named numbers that we are likely to encounter.
- all number between 1 and 10 (e.g. one, two, five etc.)
- all number between 11 and 20 (e.g. eleven, twelve, sixteen etc.)
- all multiples of 10 between 20 and 100 (e.g. thirty, forty etc.)
- all uniquely named powers of 2 over 100 (e.g. hundred, thousand, million, billion etc.)
Here is a cut-down version:
|
|
We don’t really think about it, since we are so used to it, but that is actually a lot more unique numbers than strictly necessary. You can easily identify all possible numbers using just 1 through 9 as well as all the named powers of 10, such as ten, hundred, thousand etc (we’ll explore this thought further shortly). Doing it this way would be pretty logical, instead the English speaking world decided that we needed nine special words to represent numbers between 11 and 19 as well as eight more similar, but still unique, words for the multiples of 10 between 20 and 100. As much as we take it for granted this causes all sorts of pain when we try to implement out integer to words converter.
Consider this, we’re given an integer several digits long, let’s say 12.
- to get a word representation we start looking at the digits of the integer in reverse order (we have to start with the least significant digit first):
- digits = 21
- we use the look-up table to find the word that represents the first digit and add it to an accumulator (which is a list that will contain all the words in the English representation of our integer):
- digits = 21, accumulator = [two]
But as soon as we move on to the second digit, we’re already in trouble. If this digit is a one, we know we have a number between 11 and 19, so we have to do the following:
- backtrack the previous digit from the accumulator:
- digits = 21, accumulator = []
- get the current digit along with the previous digit and reverse them to get our actual number:
- digits = 12, accumulator = []
- do a lookup to get the new word representation and put the new representation in the accumulator:
- digits = 12, accumulator = [twelve]
The story is similar if the second digit is 2 through 9, we still have to backtrack, then we need to do a lookup on the second digit multiplied by 10 (i.e. 20, 30 etc), combine the result with the word representation of the previous digit and put that back in the accumulator. So if the number is 20, we need to look up 20, combine that with 5 and get twenty five which ends up in the accumulator.
When the number is big enough, this patters comes up again and again every few digits. For example, numbers such as 11000, 23000, 134000 need to be represented as eleven thousand, twenty three thousand and one hundred and thirty four thousand (see what I mean). Luckily it is a pattern so we can handle all occurrences of it as a special case, but it’s a pattern that need not be. And talking about patterns that need not be, what is our obsession with the word hundred:
- three hundred and fifty
- one hundred and twenty thousand three hundred and fifty
- one hundred and twenty five million three hundred and twenty thousand five hundred and six
We have the word hundred appearing 3 times in that last one, we’re used to it and it feels comfortable, but is it strictly necessary? We have to handle that one as a special case as well. Needless to say, the code ends up being long and ugly, you can have a look at the full source here, but the main chunk of it is something along the following lines:
|
|
You can have a go at running it, but don’t forget, I only took the lookup table up to trillions, so if you want to try numbers bigger than that you will need to add more named powers of ten (such as quadrillion, quintillion etc.) to the table. Here is some sample output:
315119 - three hundred and fifteen thousand one hundred and nineteen 1000001 - one million and one 1315119 - one million three hundred and fifteen thousand one hundred and nineteen 11315119 - eleven million three hundred and fifteen thousand one hundred and nineteen
Using Code To Derive The English That Should Have Been
Even though my code was producing reasonable output, I wasn’t entirely happy, the arbitrary nature of the English language made my code uglier than it could have been. Since I’ve taken liberties with the English language before I decided this was a good time to do it again. What if we were to remove all the extraneous words that English had for numbers. Surely, the numbers 1 through 9 are more than sufficient? This would be the simplest possible case e.g.:
135 - one three five 3619 - three six one nine
Unfortunately it just doesn’t cut it. Firstly we would have to introduce 0 as a non-silent number e.g.:
609 - six zero nine
Currently, zero is completely silent in all numbers unless it is by itself e.g.:
609 - six hundred and nine (see, zero is silent)
But even if this wasn’t an issue, it turns out we actually need the named powers of ten to make English representations of integers meaningful. One five six two three, tells us almost nothing, but fifteen thousand six hundred and twenty three gives us a lot of information instantly. Even when we haven’t processed the number fully, we know straight away that it is approximately 15 thousand and a few hundred. The bigger the number gets, the more pronounced this “estimation” effect is. So, we need numbers one through nine as well as named powers of 10. This allows us to cut our lookup table down to the following:
|
|
We can use this table to write code that is much more intuitive than our previous attempt. Infact, we don’t even need to know exactly what we’re trying to achieve, instead we let the code guide us to the “correct” solution. We’re using code to derive the requirements for the real world which is the opposite of the usual scenario. The full implementation I came up with is in this gist, but here is the main bit:
|
|
You have to admit, it’s much shorter and nicer looking and here is some of the output produced:
0 - zero 1 - one 3 - three 5 - five 11 - one ten one 15 - one ten five 25 - two ten five 71 - seven ten one 40 - four ten 100 - one hundred 101 - one hundred one 112 - one hundred one ten two 123 - one hundred two ten three 457 - four hundred five ten seven 999 - nine hundred nine ten nine 1000 - one thousand 1001 - one thousand one 1010 - one thousand one ten 1011 - one thousand one ten one 2117 - two thousand one hundred one ten seven 3001 - three thousand one 13101 - one three thousand one hundred one 14001 - one four thousand one 16000 - one six thousand 25119 - two five thousand one hundred one ten nine 65009 - six five thousand nine 315119 - three one five thousand one hundred one ten nine 1000001 - one million one 1315119 - one million three one five thousand one hundred one ten nine 11315119 - one one million three one five thousand one hundred one ten nine 74315119 - seven four million three one five thousand one hundred one ten nine 174315119 - one seven four million three one five thousand one hundred one ten nine 1174315119 - one billion one seven four million three one five thousand one hundred one ten nine 15174315119 - one five billion one seven four million three one five thousand one hundred one ten nine 35174315119 - three five billion one seven four million three one five thousand one hundred one ten nine 935174315119 - nine three five billion one seven four million three one five thousand one hundred one ten nine 2935174315119 - two trillion nine three five billion one seven four million three one five thousand one hundred one ten nine
As you can see, all the named powers of ten are now treated in exactly the same way i.e.:
- 10 is ten to the power of 1
- 100 is ten to the power of 2
- etc.
So 101 is one hundred one and 11 is one ten one, 203 is two hundred three, while 23 is two ten three – logical isn’t it? There is no longer any need for any specially named numbers between 10 and 100 since we can represent all of them just as efficiently with this notation; this means no more special cases for numbers under 100. The other one we had was all the extra usages of the word hundred. We simply eliminate them all together (except for when we are talking about numbers between 100 and 1000), three one five thousand conveys the same amount of information as three hundred and fifteen thousand – the word hundred is totally redundant. Now, every named power of 10 appears only once in the English representation of the integer which allows us to still retain the ability to “estimate” the magnitude instantly without any extraneous information:
one million three one five thousand one hundred one ten nine
Short, consistent and to the point. It does seem a little awkward at first, but after a few minutes it starts to grow on you – after all, it is much more logical than the system we have now. There you go, deriving “better” English through some Ruby code. Not only do we get superior language but it is also easier, faster and less error prone to code it – FTW brothers, FTW :)!!!
Image by plindberg