Parsing Roman Numbers—Beautifully

Posted on 2009-03-26 by Curt Sampson :: comments enabled

In a recent TSAC meeting, Travis brought in rather interesting little problem: parsing Roman numerals. It’s more difficult than it looks, at least if you want to write beautiful code.

I’d looked at this a couple of times since the meeting, but today I finally sat down and built a reasonably nice solution to the problem (in Haskell, of course), which I’ll share here. I’ll present the code more or less in the order of my thinking about how to solve the problem, and at a few points I’ll introduce for the purposes of explanation and example code that isn’t in the real implementation. I’ll also entirely pass over the tests and the (trivial) code for running the parser on an input file and generating output. However, you can download the runnable source and play with it yourself if you like; I encourage you to do so.

So let’s dive in!

We not only have to convert valid Roman numerals to their Arabic form, but also indicate when a string of characters is not a valid Roman numeral. Thus, the first step is to figure out just what the rules are. Having had a couple of quick stabs at this before, and found that it turned out to be a fair amount of work for my tiny little brain, I wanted not only to write down the rules, but write them down in as clear a form as possible so I could read and understand them later. The beauty of Haskell, of course, is that I can both write down the rules in this very clear way, and have a valid Haskell program that I can execute.

The letters themselves were pretty easy; we just match up the characters and their values:

i = letter 'I' 1
v = letter 'V' 5
x = letter 'X' 10
l = letter 'L' 50
c = letter 'C' 100
d = letter 'D' 500
m = letter 'M' 1000

You’ll note that the letter function does not exist at this point, nor do I even have a type for it yet. This is the way I work: I first write the perfect, clear, explanatory top-level code that I want, and I worry about implementing the functions (such as letter) and determining the types of the functions and their result values, later. (I’m a very “top-down” programmer these days, which is quite the reverse of where I was ten years ago.)

Now, a Roman numeral can have up to three Ms at the beginning, and then after that the rules are similar for the groups of letters within the hundreds, tens and ones, or what I’ve decided to call “decades.” Each decade uses a combination of letters for the unit (such as I), five of that unit (V), and ten of that unit (X) to express what in Arabic numerals are the ones place, tens place, and hundreds place. (I would have said “column” rather than “place,” had I not grown up just as we were switching to New Math.)

The key insight here for me was that a decade uses not only the letters representing the unit and five times the unit, but also the letter representing ten times the unit, which is the unit from the next decade up. We never have a number that big within the decade, but the letter is used within the decade in combination to represent one less than the unit of the next decade (e.g., ”IX” for 9).

The top decade, the thousands, is slightly different in that it can’t combine with decades above it, because there are none. So we start with up to three Ms, signifying thousands, and then have decades for the hundreds (D and C), tens (L and X), and ones (V and I), which all work in the exact same way:

    =  addingUpTo 3 m
    >. decade m d c
    >. decade c l x
    >. decade x v i
    >. end

We start with up to three ’M’ characters that add a thousand each to the total for this number (remember m = letter 'M' 1000' above?) and then we’ve got decades of the smaller quantities. Here I’ve introduced >. as the sequencing operator, indicating the order in which the components appear. Note also that I’ve indicated that the string we’re parsing must end after the lowest decade; XIV is a valid Roman numeral, but XIVX is not!

But how does a decade work? It turns there are five alternatives. One, of course, is that it doesn’t exist at all: MLXI (1061) has no hundreds decade. Otherwise it’s a combination of the letters where, depending on their position, they either add to or subtract from the total. So, adding some operators to determine whether a letter is adding or subtracting, using my sequencing operator from above, and adding |. as the alternatives operator, I can list all the alternatives:

decade decem quintum unit
    =  subtracting unit >. adding decem
    |. subtracting unit >. adding quintum
    |. adding quintum   >. addingUpTo 3 unit
    |. addingUpTo 3 unit
    |. nothing

In less mathematical notation, this means we may have one of the following alternatives:

  • a unit (I) before a decem (ten times the unit, X), meaning that we subtract the unit from the decem (IX is 9)

  • a unit before a qunitum, meaning that we subtract the unit from the quintum (IV is 4)

  • a quintum (V) before up to three units (V, VI, VII or VIII)

  • up to three units alone (nothing, I, II or III)

  • or nothing at all, ””

And that’s pretty much it for the rules! All that’s left is to implement the various functions I’ve invented to express the rules.

I do this in a way similar to a standard monadic parser. (Why did I not implement this as a Monad? That’s an interesting topic for discussion.)

First, we define a state for the parser that holds what remains to parse and the result of our parse so far, if we’re currently parsing, or otherwise indicates a successful parse or failure. We also define a function that, given the string to parse, creates an initial state. I use a ByteString here instead of a String for efficiency1.

data State
    = Parsing ByteString Int
    | Success Int
    | Failure

input :: ByteString -> State
input s = Parsing s 0

Then we declare a type for a parser: a function, that, given a State, returns a new State indicative of the work it’s done. romanNumber above is of this Parser type, as are the various things we combine to make that parser. The sequencing and alternative operators take two Parsers and return a new Parser that combines them in sequence or chooses one. I also define the priority of the those operators to be the same as the “and” and “or” boolean operators, to which they are parallel.

newtype Parser = Parser { run :: State -> State }

romanNumber :: Parser
(>.), (|.) :: Parser -> Parser -> Parser
infixr 3 >.
infixr 2 |.

The run accessor above allows us to get the parsing function out of the Parser, so applying run p to a State will give us the new state produced by parser p.

Now we can create a function that, given a Parser as defined above, runs it and if it finishes successfully, returns the result, otherwise returns Nothing to indicate failure. We also add a convenience function to run the romanNumber parser on a ByteString.

parse :: Parser -> ByteString -> Maybe Int
parse parser s =
    case run parser $ input s of
         Parsing {} -> Nothing
         Failure    -> Nothing
         Success n  -> Just n

parseRoman :: ByteString -> Maybe Int
parseRoman s = parse romanNumber s

All that done, it’s time to implement some of the lower-level pieces that make romanNumber and decade work!

After a fair amount of mucking about, I realized that my concept of a “letter,” introduced in the first fragment of code at the very top of this post, was missing something essential, though it was very strongly hinted at in the way it was used in decade. Letters actually do different things depending on their position: in ”VI” the ”I” adds 1 to the number we’re building, but in ”IV” the ”I” subtracts 1.

So let’s introduce a type to explain this: we’ll have an operator, that being add or subtract, and the letter function will take that operator and return a parser that makes the appropriate state change the appropriate thing when it parses the letter:

type Op     = (Int -> Int -> Int)
type Letter = (Op) -> Parser
i, v, x, l, c, d, m :: Letter

That leads to appropriate type declarations for the adding and subtracting functions we used in decade:

adding, subtracting :: Letter -> Parser
addingUpTo          :: Int -> Letter -> Parser

The definitions of the first two are easy:

adding letter      = letter (+)
subtracting letter = letter (-)

addingUpTo is a little more complex, but fairly obvious when you think about it: it’s either nothing, or a letter followed by a recursive application of itself.

addingUpTo n letter = adding letter >. addingUpTo (n-1) letter
                      |. nothing

But what’s the base case of this recursion? Once we’ve read up to as many as we’re going to, we stop, and let the next parser in the sequence take over. We neither consume any letters nor change the number we’ve calculated so far, which means we don’t change the state at all. This is very simply when we have higher-order functions, pack the “give back what you got” function into a Parser:

addingUpTo 0 _      = Parser id

If that confuses you, it may help to think of the same thing expressed as a lambda expression returning exactly what it got:

addingUpTo 0 _      = Parser (\x -> x)

(Note that in the actual program, that base case equation comes before the other one; we need to match 0 as the first parameter before we try the other equation, which matches anything.)

Now that we’ve got that, how do we define letter itself? Well, it takes a Roman character and amount that it stands for, and produces a function that, given an add or subtract operation, produces a parser that will parse that letter and perform the operation.

letter :: Char -> Int -> Letter

Our actual definition looks a bit different from this, though; it starts:

letter roman arabic op =

What’s up with this? We had two arguments in the type definition, but have three in the equation! But remember, a Letter is actually an (Op) -> Parser, and an Op is in turn an (Int -> Int -> Int). So doing standard algebraic substitution here, just like we did in high school (we’re now beyond New Math here), we have:

letter :: Char -> Int -> Letter
letter :: Char -> Int -> (Op) -> Parser
letter :: Char -> Int -> (Int -> Int -> Int) -> Parser

So our third parameter, op, is a function that takes two Ints and returns an Int.

But just in case I’ve not lost you yet, it gets more complex. Remember, a Parser is actually a wrapped function as well:

newtype Parser = Parser { run :: State -> State }

so we need to return a function wrapped by handing it to the Parser value constructor. This is typically done with a lambda expression. Recall that a State is

data State
    = Parsing ByteString Int
    | Success Int
    | Failure

So, if we had a parser that merely added one to the Int within our state when we’re in a Parsing state, we could write it as

addOne :: Parser
addOne = Parser (\Parsing s n -> Parsing s (n+1))

or, alternatively expressed using $ instead of parentheses:

addOne = Parser $ \Parsing s n -> Parsing s (n+1)

What we’re doing here is creating a function that takes a Parsing state, binds the variables s and n to the ByteString and Int within it, and returns a new Parsing state with the same ByteString, but with an Int incremented by one. We then wrap this function back up into a Parser by passing it to the Parser value constructor. (This parser doesn’t deal with the Success Int and Failure values of State, as befits a simple example designed to break in the real world.)

So now we can look at our full definition of letter:

letter roman arabic op =
    Parser $ \state ->
        case state of
             Parsing s n ->
                 case () of
                   _ | s == empty      -> Failure
                     | head s == roman -> Parsing (tail s) (n `op` arabic)
                     | otherwise       -> Failure
             _ -> Failure

Here we return a parser that, given a Parsing state with a non-empty ByteString whose first character is roman, returns a new Parsing state with the first character of the ByteString consumed (removed), and the parsed value so far having arabic added to or subtracted from it, depending on whether op was + or -. In any other case, it returns Failure.

It may be helpful to walk carefully through the expansion of adding c. We have

c = letter 'C' 100
adding letter = letter (+)

giving us

adding c = letter 'C' 100 (+)

and then

letter 'C' 100 (+) =
    Parser $ \state ->
        case state of
             Parsing s n ->
                 case () of
                   _ | s == empty      -> Failure
                     | head s == 'C'   -> Parsing (tail s) (n + 100)
                     | otherwise       -> Failure
             _ -> Failure

If this still isn’t quite clear, it may be helpful to read some of the other examples further on in this article and then come back to this.

But now, we’re almost done. We still need to give definitions of the sequence (“and”) and alternative (“or”) operators.

The sequence operator runs the first parser and, if it succeeds (i.e., returns an updated state that indicates we are continuing parsing), runs the second parser on the state produced by the first:

(>.) :: Parser -> Parser -> Parser
a >. b = Parser $ \state ->
             case run a state of
                  Parsing s n -> run b (Parsing s n)
                  _           -> Failure

The alternative operator does the exact opposite: it returns the state that the first parser produced if it succeeded, but if the first parser fails, instead runs the second parser on the original state:

(|.) :: Parser -> Parser -> Parser
a |. b = Parser $ \state ->
             case run a state of
                  Failure -> run b state
                  state'  -> state'

That taken care of, we have only nothing and end left. nothing, as with the base case of addingUpTo, says “everything is ok, carry on” by consuming nothing from the ByteString and not changing the currently calculated value, i.e., returning the state unchanged:

nothing :: Parser
nothing = Parser id

end, on the other hand, is a bit more special. Recall that it’s used after we’ve finished parsing all of our decades, and we should have nothing more, as so:

    =  addingUpTo 3 m
    >. decade m d c
    >. decade c l x
    >. decade x v i
    >. end

This is the point at which we either declare success, if we’ve parsed everything fine so far and there’s nothing left in the string, or failure if there was a problem earlier on or we think we should be done but there’s still stuff remaining parse. (Recall above, I mentioned that ”XIVX” is not a valid Roman numeral; the parser would be successful through ”XIV” and then would fail on the final unexpected ”X”. It’s instructive to work though precisely what decade x v i would do with the above input.)

But there’s one one more little hitch: addingUpTo and decade both match no input at all. It’s of course possible to have a valid Roman numeral with no Ms at the beginning, and also with missing decades, as per my example above. This means that, given an empty string, addingUpTo and the following three decades will happily match it. However, the empty string is not a Roman numeral!

Here, we can take advantage of the fact that there’s no zero in Roman numerals, and that happens to be our initial value of the result (or really, “result so far”) in the state. If, at the end, everything matches but our result is still zero, we’ve been given an empty string instead of a valid Roman numeral. In that case, we also need to fail. So our definition is:

end :: Parser
end     = Parser $ \state ->
            case state of Parsing "" 0 -> Failure
                          Parsing "" n -> Success n
                          _            -> Failure

And now, appropriately enough, on that function we’ve come to the end of this post, which is more than three times as long as the program itself, including various I/O logic and unit tests that I’ve not mentioned here. That’s a testament to the concision of Haskell. But more importantly, I think that the highest levels of this program, the first three examples of code introduced in this post, express, much more clearly than usual in computer programming, the specification of what we’re doing. That’s beautiful code.

  1. The actual speed of this program compared to (incomprehensible) C versions is discussed a little bit here. Summary: it’s reasonably efficient.


By Sandro Pasquali on 2009-03-27 15:50:53-0400:

This problem was the starting point for an informal contest we had internally. Here are some of the more interesting ones:

def roman_to_number(roman):
return reduce(lambda acc, x: acc+x - 2*(acc%x),
	for n in roman], 0)
print roman_to_number('MCMXIV')

var s=prompt("Enter a roman numeral:","MMIX").toUpperCase();
var c={73:1,86:5,88:10,76:50,67:100,68:500,77:1000};
var lv=0,f=0;
for(var x=s.length-1;x>=0;x--){
alert(s+" = "+f);

curl -s -A ""
"http://www.google.com/search?hl=en&q=$1+to+decimal&btnG=Search"|grep -o
"[IVXLCDM] = [0-9]"
By Omar AntolĂ­n Camarena on 2009-03-28 14:43:15-0400:

Your high level code is definitely beautiful but I find your lower level supporting code to be too complicated.

For such a simple parsing problem I’d just use an integer accumulator to calculate the value of the roman number. You can use parsers with the simple type (Int, String) -> Maybe (Int, String); each parser gets the current value of the accumulator, the rest on the input and returns Just the new value of the accumulator and the remainder of the input or Nothing if there is an error.

I don’t like writing plumbing code when I can use library functions instead. For this representation of parsers, >. is just Kleisli composition >=> (from Control.Monad) and |. is just pointwise mplus:

(f |. g) x = f x `mplus` g x

Letters are very simple

letter roman arabic (accum, input)
    | null input or roman/=head input = Nothing
    | otherwise = Just (accum + arabic, tail input)

To make a letter subtract instead of add just negate the accumulator, run the letter and negate again:

sub letter (accum, input) =
    letter (-accum, input) >>= (\(acc, rest) -> Just (-acc, rest))

So, decade would look like this (I also factored it more because I don’t like repeating code):

decade decem quintum unit =
    (sub unit >=> (quintum |. decem))
    |.  ((quintum |. nothing) >=> addingUpTo 3 unit)

(Note that in this repr. of parsers, nothing = Just, which is mildly funny.)

By Curt Sampson on 2009-05-07 11:46:30:

Sandro: The trick, of course, is that you must also indicate when a string is not a valid roman numeral. This makes the problem considerably more difficult, I found.

Omar: I entirely agree with you that the lower-level stuff is too complicated, and that’s in large part due to my lack of familiarity with the standard libraries.

Add a comment »
comments are moderated