## Parsing Roman Numbers—Beautifully

Posted on 2009-03-26 by Curt Sampson :: comments enabled

In a recent TSAC meeting, Travis brought in rather interesting little problem: parsing Roman numerals. It’s more difficult than it looks, at least if you want to write beautiful code.

I’d looked at this a couple of times since the meeting, but today I finally sat down and built a reasonably nice solution to the problem (in Haskell, of course), which I’ll share here. I’ll present the code more or less in the order of my thinking about how to solve the problem, and at a few points I’ll introduce for the purposes of explanation and example code that isn’t in the real implementation. I’ll also entirely pass over the tests and the (trivial) code for running the parser on an input file and generating output. However, you can download the runnable source and play with it yourself if you like; I encourage you to do so.

So let’s dive in!

We not only have to convert valid Roman numerals to their Arabic form, but also indicate when a string of characters is not a valid Roman numeral. Thus, the first step is to figure out just what the rules are. Having had a couple of quick stabs at this before, and found that it turned out to be a fair amount of work for my tiny little brain, I wanted not only to write down the rules, but write them down in as clear a form as possible so I could read and understand them later. The beauty of Haskell, of course, is that I can both write down the rules in this very clear way, and have a valid Haskell program that I can execute.

The letters themselves were pretty easy; we just match up the characters and their values:

``````i = letter 'I' 1
v = letter 'V' 5
x = letter 'X' 10
l = letter 'L' 50
c = letter 'C' 100
d = letter 'D' 500
m = letter 'M' 1000``````

You’ll note that the `letter` function does not exist at this point, nor do I even have a type for it yet. This is the way I work: I first write the perfect, clear, explanatory top-level code that I want, and I worry about implementing the functions (such as `letter`) and determining the types of the functions and their result values, later. (I’m a very “top-down” programmer these days, which is quite the reverse of where I was ten years ago.)

Now, a Roman numeral can have up to three `M`s at the beginning, and then after that the rules are similar for the groups of letters within the hundreds, tens and ones, or what I’ve decided to call “decades.” Each decade uses a combination of letters for the unit (such as `I`), five of that unit (`V`), and ten of that unit (`X`) to express what in Arabic numerals are the ones place, tens place, and hundreds place. (I would have said “column” rather than “place,” had I not grown up just as we were switching to New Math.)

The key insight here for me was that a decade uses not only the letters representing the unit and five times the unit, but also the letter representing ten times the unit, which is the unit from the next decade up. We never have a number that big within the decade, but the letter is used within the decade in combination to represent one less than the unit of the next decade (e.g., ”`IX`” for 9).

The top decade, the thousands, is slightly different in that it can’t combine with decades above it, because there are none. So we start with up to three `M`s, signifying thousands, and then have decades for the hundreds (`D` and `C`), tens (`L` and `X`), and ones (`V` and `I`), which all work in the exact same way:

``````romanNumber
>. end``````

We start with up to three ’`M`’ characters that add a thousand each to the total for this number (remember `m = letter 'M' 1000'` above?) and then we’ve got decades of the smaller quantities. Here I’ve introduced `>.` as the sequencing operator, indicating the order in which the components appear. Note also that I’ve indicated that the string we’re parsing must end after the lowest decade; `XIV` is a valid Roman numeral, but `XIVX` is not!

But how does a decade work? It turns there are five alternatives. One, of course, is that it doesn’t exist at all: `MLXI` (1061) has no hundreds decade. Otherwise it’s a combination of the letters where, depending on their position, they either add to or subtract from the total. So, adding some operators to determine whether a letter is adding or subtracting, using my sequencing operator from above, and adding `|.` as the alternatives operator, I can list all the alternatives:

``````decade decem quintum unit
=  subtracting unit >. adding decem
|. subtracting unit >. adding quintum
|. nothing``````

In less mathematical notation, this means we may have one of the following alternatives:

• a unit (`I`) before a decem (ten times the unit, `X`), meaning that we subtract the unit from the decem (`IX` is 9)

• a unit before a qunitum, meaning that we subtract the unit from the quintum (`IV` is 4)

• a quintum (`V`) before up to three units (`V`, `VI`, `VII` or `VIII`)

• up to three units alone (nothing, `I`, `II` or `III`)

• or nothing at all, ””

And that’s pretty much it for the rules! All that’s left is to implement the various functions I’ve invented to express the rules.

I do this in a way similar to a standard monadic parser. (Why did I not implement this as a `Monad`? That’s an interesting topic for discussion.)

First, we define a state for the parser that holds what remains to parse and the result of our parse so far, if we’re currently parsing, or otherwise indicates a successful parse or failure. We also define a function that, given the string to parse, creates an initial state. I use a `ByteString` here instead of a `String` for efficiency1.

``````data State
= Parsing ByteString Int
| Success Int
| Failure

input :: ByteString -> State
input s = Parsing s 0``````

Then we declare a type for a parser: a function, that, given a `State`, returns a new `State` indicative of the work it’s done. `romanNumber` above is of this `Parser` type, as are the various things we combine to make that parser. The sequencing and alternative operators take two `Parser`s and return a new `Parser` that combines them in sequence or chooses one. I also define the priority of the those operators to be the same as the “and” and “or” boolean operators, to which they are parallel.

``````newtype Parser = Parser { run :: State -> State }

romanNumber :: Parser
(>.), (|.) :: Parser -> Parser -> Parser
infixr 3 >.
infixr 2 |.``````

The `run` accessor above allows us to get the parsing function out of the `Parser`, so applying `run p` to a `State` will give us the new state produced by parser `p`.

Now we can create a function that, given a `Parser` as defined above, runs it and if it finishes successfully, returns the result, otherwise returns `Nothing` to indicate failure. We also add a convenience function to run the `romanNumber` parser on a ByteString.

``````parse :: Parser -> ByteString -> Maybe Int
parse parser s =
case run parser \$ input s of
Parsing {} -> Nothing
Failure    -> Nothing
Success n  -> Just n

parseRoman :: ByteString -> Maybe Int
parseRoman s = parse romanNumber s``````

All that done, it’s time to implement some of the lower-level pieces that make `romanNumber` and `decade` work!

After a fair amount of mucking about, I realized that my concept of a “letter,” introduced in the first fragment of code at the very top of this post, was missing something essential, though it was very strongly hinted at in the way it was used in `decade`. Letters actually do different things depending on their position: in ”`VI`” the ”`I`” adds 1 to the number we’re building, but in ”`IV`” the ”`I`” subtracts 1.

So let’s introduce a type to explain this: we’ll have an operator, that being add or subtract, and the letter function will take that operator and return a parser that makes the appropriate state change the appropriate thing when it parses the letter:

``````type Op     = (Int -> Int -> Int)
type Letter = (Op) -> Parser
i, v, x, l, c, d, m :: Letter``````

That leads to appropriate type declarations for the adding and subtracting functions we used in `decade`:

``````adding, subtracting :: Letter -> Parser
addingUpTo          :: Int -> Letter -> Parser``````

The definitions of the first two are easy:

``````adding letter      = letter (+)
subtracting letter = letter (-)``````

`addingUpTo` is a little more complex, but fairly obvious when you think about it: it’s either nothing, or a letter followed by a recursive application of itself.

``````addingUpTo n letter = adding letter >. addingUpTo (n-1) letter
|. nothing``````

But what’s the base case of this recursion? Once we’ve read up to as many as we’re going to, we stop, and let the next parser in the sequence take over. We neither consume any letters nor change the number we’ve calculated so far, which means we don’t change the state at all. This is very simply when we have higher-order functions, pack the “give back what you got” function into a `Parser`:

``addingUpTo 0 _      = Parser id``

If that confuses you, it may help to think of the same thing expressed as a lambda expression returning exactly what it got:

``addingUpTo 0 _      = Parser (\x -> x)``

(Note that in the actual program, that base case equation comes before the other one; we need to match `0` as the first parameter before we try the other equation, which matches anything.)

Now that we’ve got that, how do we define `letter` itself? Well, it takes a Roman character and amount that it stands for, and produces a function that, given an add or subtract operation, produces a parser that will parse that letter and perform the operation.

``letter :: Char -> Int -> Letter``

Our actual definition looks a bit different from this, though; it starts:

``letter roman arabic op =``

What’s up with this? We had two arguments in the type definition, but have three in the equation! But remember, a `Letter` is actually an `(Op) -> Parser`, and an `Op` is in turn an `(Int -> Int -> Int)`. So doing standard algebraic substitution here, just like we did in high school (we’re now beyond New Math here), we have:

``````letter :: Char -> Int -> Letter
letter :: Char -> Int -> (Op) -> Parser
letter :: Char -> Int -> (Int -> Int -> Int) -> Parser``````

So our third parameter, `op`, is a function that takes two `Int`s and returns an `Int`.

But just in case I’ve not lost you yet, it gets more complex. Remember, a `Parser` is actually a wrapped function as well:

``newtype Parser = Parser { run :: State -> State }``

so we need to return a function wrapped by handing it to the `Parser` value constructor. This is typically done with a lambda expression. Recall that a `State` is

``````data State
= Parsing ByteString Int
| Success Int
| Failure``````

So, if we had a parser that merely added one to the `Int` within our state when we’re in a `Parsing` state, we could write it as

``````addOne :: Parser
addOne = Parser (\Parsing s n -> Parsing s (n+1))``````

or, alternatively expressed using `\$` instead of parentheses:

``addOne = Parser \$ \Parsing s n -> Parsing s (n+1)``

What we’re doing here is creating a function that takes a `Parsing` state, binds the variables `s` and `n` to the ByteString and Int within it, and returns a new `Parsing` state with the same ByteString, but with an Int incremented by one. We then wrap this function back up into a Parser by passing it to the `Parser` value constructor. (This parser doesn’t deal with the `Success Int` and `Failure` values of `State`, as befits a simple example designed to break in the real world.)

So now we can look at our full definition of `letter`:

``````letter roman arabic op =
Parser \$ \state ->
case state of
Parsing s n ->
case () of
_ | s == empty      -> Failure
| head s == roman -> Parsing (tail s) (n `op` arabic)
| otherwise       -> Failure
_ -> Failure``````

Here we return a parser that, given a `Parsing` state with a non-empty ByteString whose first character is `roman`, returns a new `Parsing` state with the first character of the ByteString consumed (removed), and the parsed value so far having `arabic` added to or subtracted from it, depending on whether `op` was `+` or `-`. In any other case, it returns `Failure`.

It may be helpful to walk carefully through the expansion of `adding c`. We have

``````c = letter 'C' 100

giving us

``adding c = letter 'C' 100 (+)``

and then

``````letter 'C' 100 (+) =
Parser \$ \state ->
case state of
Parsing s n ->
case () of
_ | s == empty      -> Failure
| head s == 'C'   -> Parsing (tail s) (n + 100)
| otherwise       -> Failure
_ -> Failure``````

If this still isn’t quite clear, it may be helpful to read some of the other examples further on in this article and then come back to this.

But now, we’re almost done. We still need to give definitions of the sequence (“and”) and alternative (“or”) operators.

The sequence operator runs the first parser and, if it succeeds (i.e., returns an updated state that indicates we are continuing parsing), runs the second parser on the state produced by the first:

``````(>.) :: Parser -> Parser -> Parser
a >. b = Parser \$ \state ->
case run a state of
Parsing s n -> run b (Parsing s n)
_           -> Failure``````

The alternative operator does the exact opposite: it returns the state that the first parser produced if it succeeded, but if the first parser fails, instead runs the second parser on the original state:

``````(|.) :: Parser -> Parser -> Parser
a |. b = Parser \$ \state ->
case run a state of
Failure -> run b state
state'  -> state'``````

That taken care of, we have only `nothing` and `end` left. `nothing`, as with the base case of `addingUpTo`, says “everything is ok, carry on” by consuming nothing from the ByteString and not changing the currently calculated value, i.e., returning the state unchanged:

``````nothing :: Parser
nothing = Parser id``````

`end`, on the other hand, is a bit more special. Recall that it’s used after we’ve finished parsing all of our decades, and we should have nothing more, as so:

``````romanNumber
>. end``````

This is the point at which we either declare success, if we’ve parsed everything fine so far and there’s nothing left in the string, or failure if there was a problem earlier on or we think we should be done but there’s still stuff remaining parse. (Recall above, I mentioned that ”`XIVX`” is not a valid Roman numeral; the parser would be successful through ”`XIV`” and then would fail on the final unexpected ”`X`”. It’s instructive to work though precisely what `decade x v i` would do with the above input.)

But there’s one one more little hitch: `addingUpTo` and `decade` both match no input at all. It’s of course possible to have a valid Roman numeral with no `M`s at the beginning, and also with missing decades, as per my example above. This means that, given an empty string, `addingUpTo` and the following three `decade`s will happily match it. However, the empty string is not a Roman numeral!

Here, we can take advantage of the fact that there’s no zero in Roman numerals, and that happens to be our initial value of the result (or really, “result so far”) in the state. If, at the end, everything matches but our result is still zero, we’ve been given an empty string instead of a valid Roman numeral. In that case, we also need to fail. So our definition is:

``````end :: Parser
end     = Parser \$ \state ->
case state of Parsing "" 0 -> Failure
Parsing "" n -> Success n
_            -> Failure``````

And now, appropriately enough, on that function we’ve come to the end of this post, which is more than three times as long as the program itself, including various I/O logic and unit tests that I’ve not mentioned here. That’s a testament to the concision of Haskell. But more importantly, I think that the highest levels of this program, the first three examples of code introduced in this post, express, much more clearly than usual in computer programming, the specification of what we’re doing. That’s beautiful code.

1. The actual speed of this program compared to (incomprehensible) C versions is discussed a little bit here. Summary: it’s reasonably efficient.

By Sandro Pasquali on 2009-03-27 15:50:53-0400:

This problem was the starting point for an informal contest we had internally. Here are some of the more interesting ones:

``````def roman_to_number(roman):
return reduce(lambda acc, x: acc+x - 2*(acc%x),
[{'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}[n]
for n in roman], 0)
end
print roman_to_number('MCMXIV')``````

``````var s=prompt("Enter a roman numeral:","MMIX").toUpperCase();
var c={73:1,86:5,88:10,76:50,67:100,68:500,77:1000};
var lv=0,f=0;
for(var x=s.length-1;x>=0;x--){
cv=c[s.charCodeAt(x)];
(cv>=lv)?f+=cv:f-=cv;lv=cv;
}

``````#!/bin/bash
curl -s -A ""
"[IVXLCDM] = [0-9]"``````
By Omar Antolín Camarena on 2009-03-28 14:43:15-0400:

Your high level code is definitely beautiful but I find your lower level supporting code to be too complicated.

For such a simple parsing problem I’d just use an integer accumulator to calculate the value of the roman number. You can use parsers with the simple type `(Int, String) -> Maybe (Int, String)`; each parser gets the current value of the accumulator, the rest on the input and returns Just the new value of the accumulator and the remainder of the input or Nothing if there is an error.

I don’t like writing plumbing code when I can use library functions instead. For this representation of parsers, `>.` is just Kleisli composition `>=>` (from `Control.Monad`) and `|.` is just pointwise `mplus`:

``(f |. g) x = f x `mplus` g x``

Letters are very simple

``````letter roman arabic (accum, input)
| null input or roman/=head input = Nothing
| otherwise = Just (accum + arabic, tail input)``````

To make a letter subtract instead of add just negate the accumulator, run the letter and negate again:

``````sub letter (accum, input) =
letter (-accum, input) >>= (\(acc, rest) -> Just (-acc, rest))``````

So, decade would look like this (I also factored it more because I don’t like repeating code):

``````decade decem quintum unit =
(sub unit >=> (quintum |. decem))
|.  ((quintum |. nothing) >=> addingUpTo 3 unit)``````

(Note that in this repr. of parsers, `nothing = Just`, which is mildly funny.)

By Curt Sampson on 2009-05-07 11:46:30:

Sandro: The trick, of course, is that you must also indicate when a string is not a valid roman numeral. This makes the problem considerably more difficult, I found.

Omar: I entirely agree with you that the lower-level stuff is too complicated, and that’s in large part due to my lack of familiarity with the standard libraries.