You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

25 KiB

title description author date published tags description layout comments code_repo
Parser Combinators Combining functions to build a parser Logan McGrath 2022-06-20T10:37:30-0700 2022-09-05T12:55:18-0700 functional programming, programming, scala, design patterns, combinators, parsing In my post I introduced function combinators as a pattern for building complex logic. I added the caveat that combinators should not be used to define sets, which I feel undersells the value of combinators as a general programming pattern. In this post we will explore combinators as an applied solution to a common problem: parsing text. post true https://bitsof.thisfieldwas.green/keywordsalad/parser-combinators/src/branch/post/parser-combinators

In my post {{linkedTitle "_posts/2022-06-16-set-function.md"}} I introduced function combinators as a pattern for building complex logic. I added the caveat that combinators should not be used to define sets, which I feel undersells the value of combinators as a general programming pattern. In this post we will explore combinators as an applied solution to a common problem: parsing text.

The code that accompanies this post may be found here.

The parse() function

Parsers begin as a simple signature, a function called parse() from an input Int offset and String to either a failed offset or matched offset.

:::{.numberLines}

trait Parse extends (((Int, String)) => Either[Int, (Int, String)])

:::

This signature enables two capabilities:

  1. On failure the offset where the parse() function failed to recognize the input is returned.
  2. On success the offset where the end of the recognized input is returned along with the input, which can be used to extract the matched input and also be passed to the next parse() function as its starting point.

Below we will introduce specialized parse() functions within the Parse companion object.

Implementing the first parse() function

The first parse() function we can implement is satisfy(). This function works by using a predicate to match the Char at the current offset within the input.

:::{.numberLines}

def satisfy(predicate: Char => Boolean): Parse = {
  case (offset, input) if predicate(input(offset)) => Right(offset + 1, input)
  case (offset, _)                                 => Left(offset)
}

:::

This function is very simple; it can match single Chars easily.

:::{.numberLines}

"satisfy" can {
  "recognize a single character if it matches" in {
    val parse = satisfy(_ == 'a')
    parse((0, "abc")) shouldBe Right((1, "abc"))
  }
  "reject a character that doesn't match" in {
    val parse = satisfy(_ == 'a')
    parse((0, "xyz")) shouldBe Left(0)
  }
}

:::

With a little more work, it can be made to match Unicode Chars!

:::{.numberLines}

"recognize Unicode characters" in {
  val parse = {
    val pattern = Pattern.compile("\\\\p{Ll}")
    satisfy(c => pattern.matcher(CharBuffer.wrap(ArrayCharSequence(Array(c)))).matches())
  }
  parse((0, "π is pi")) shouldBe Right((1, "π is pi"))
  parse((0, "Π is uppercase pi")) shouldBe Left(0)
}

:::

Thus satisfy() is sufficiently versatile to meet a wide range of needs, though it requires that you be creative in how you create your predicates.

Specializing the satisfy() parse function

Based on the previous example, recognizing a character with a regular expression is really cumbersome. Let's define a parse() function that makes this kind of recognizer easier to define.

:::{.numberLines}

def satisfyRegex(pattern: String): Parse = {
  val pattern = Pattern.compile("\\\\p{Ll}")
  satisfy(c => pattern.matcher(CharBuffer.wrap(ArrayCharSequence(Array(c)))).matches())
}

:::

Now a regular expression can be provided without creating the boilerplate to support the predicate directly. The test to show that the satisfyRegex() function is identical to the original, as well.

:::{.numberLines}

"satisfyRegex" can {
  "recognize a character using a regex" in {
    val parse = satisfyRegex("\\\\p{Ll}")
    parse((0, "π is lowercase pi")) shouldBe Right((1, "π is lowercase pi"))
    parse((0, "Π is uppercase pi")) shouldBe Left(0)
  }
}

:::

Parser combinators

So far we have two parse() functions: satisfy() and satisfyRegex(). Neither of these do more than recognize single Chars and advance the offset of the input. We want to be able to recognize sections of input.

Let's start first by recognizing two letters, one after the other, with our first combinator: the & combinator.

What is a combinator? A combinator is a function that builds a function from other functions. In this case, we will be building a new parse() function that combines the capabilities of two other parse() functions.

Here is how we define the combinator on the Parse trait:

:::{.numberLines}

def &(second: Parse): Parse =
  input => this(input).flatMap(nextInput => second(nextInput))

:::

By using the & combinator we can define new parse() functions from two existing ones, allowing us to recognize a length of input. Here is a short example using a test:

:::{.numberLines}

"&" which {
  val parseFirst = satisfy(_ == 'a')
  val parseSecond = satisfy(_ == 'b')
  val parse = parseFirst & parseSecond
  "recognizes the first letter and then the second" in {
    parse((0, "abc")) shouldBe Right((2, "abc"))
  }
  "recognizes the first letter but rejects the second" in {
    parse((0, "acc")) shouldBe Left(1)
  }
  "rejects the first letter" in {
    parse((0, "ccc")) shouldBe Left(0)
  }
}

:::

Recognizing alternatives with a second parse() function

Given the possibility that a parse() function may fail to recognize the input, is there a way that we could fall back to an alternative parse() function that would be able to recognize it? There is!

The next combinator is |. This combinator tries to recognize the input with the first parse() function and will use the second parse() function if it fails.

Here is how we define the combinator on the Parse trait:

:::{.numberLines}

def |(second: Parse): Parse =
  input => this(input).left.flatMap(_ => second(input))

:::

This combinator allows us to choose between possibilities of input. A simple example is demonstrated in its test:

:::{.numberLines}

"|" which {
  val parseFirst = satisfy(_ == 'a')
  val parseSecond = satisfy(_ == 'b')
  val parse = parseFirst | parseSecond
  "recognizes using the first parse function" in {
    parse((0, "abc")) shouldBe Right((1, "abc"))
  }
  "recognizes using the second parse function" in {
    parse((0, "bcd")) shouldBe Right((1, "bcd"))
  }
  "rejects if neither parser recognizes" in {
    parse((0, "cde")) shouldBe Left(0)
  }
}

:::

Putting together what we have so far

Now that we have sequential and alternative recognizers, we can start defining more complex parse() functions. Let's start with a term recognizer.

:::{.numberLines}

def term(value: String): Parse =
  value.tail.foldLeft(satisfy(_ == value.head))((p, c) => p & satisfy(_ == c))

:::

By using the classic functional programming idiom foldLeft(), we can build a term recognizer using only the functions and combinators we have defined so far. Slick!

Here is the test to show that our term() function works, including one using the | combinator with a second term:

:::{.numberLines}

"term" which {
  val parse = term("banana")
  "recognizes a word" in {
    parse((0, "banana apple")) shouldBe Right((6, "banana apple"))
  }
  "rejects input that doesn't match" in {
    parse((0, "banapple")) shouldBe Left(4)
  }
  "alternate with a second term" in {
    val combined = parse | term("apple")
    combined((0, "banana orange")) shouldBe Right((6, "banana orange"))
    combined((0, "apple orange")) shouldBe Right((5, "apple orange"))
  }
}

:::

How do we get the matched input?

There's a small problem with our parse() function. It only gives the end offset of the recognized input and there isn't any way to collect or modify the recognized input. This means we have to change the signature of the parse() function.

Modeling input

First, we need to consider what our input needs to model. Our input is characterized by two things, the text being parsed and the offset that the current parse() function is operating from. This defines a cursor:

:::{.numberLines}

case class InputCursor(offset: Int, input: String) {

  def peek: Char = input(offset)

  def +(amount: Int): InputCursor = copy(offset = offset + amount)

  def -(amount: Int): InputCursor = copy(offset = offset - amount)
}

:::

A cursor allows us to hold a reference to the input itself, unmodified, but increment or decrement the position we are at within the input. As our parse() functions recognize input, they increment the offset so that the following functions have a place to start.

Modeling output

The output of the parse() function changes as well. We want to be able to return the recognized input. This changes the signature of the parse() function:

:::{.numberLines}

trait Parse extends (InputCursor => Either[InputCursor, (???, InputCursor)])

:::

But what should be returned isn't entirely clear. If the recognizer matches a Char, then the parse() function should return a Char. But what about the term() recognizer? That should return a String. What if we're using a function that recognizes integers?

This means that the parse() function is generic in its return type:

:::{.numberLines}

trait Parse[A] extends (InputCursor => Either[InputCursor, (A, InputCursor)])

:::

The overall return type is also becoming verbose by using the general-purpose Either class. So let's define something specialized:

:::{.numberLines}

sealed trait ParseResult[A]

case class ParseFailure[A](cursor: InputCursor) extends ParseResult[A]

case class ParseSuccess[A](result: A, cursor: InputCursor) extends ParseResult

:::

But before we use it this means that all of our function signatures have to change!

The new shape of the parse() function

Our parse() function will soon become generic its return type, which makes it covariant in that parse() can be adapted to produce anything we want. What the parse() function produces is contextualized by whether it is recognized in the input, and as such there is a chance that nothing is produced because the input wasn't recognized. This means the the parse() function forms a specific structure: a context with the shape of a functor.

I have written about contexts, functors, and related structures extensively. Start from the following article if these terms are unfamiliar to you: {{linkedTitle "_posts/2022-03-15-contexts-and-effects.md"}}

Take a look at what the parse() function signature will look like after we start using ParseResult:

:::{.numberLines}

trait Parse[A] extends (InputCursor => ParseResult[A])

:::

Let's compare that with the shape of a functor: Parse[A] and F[A].

In fact, ParseResult[A] has the same shape!

To save some time, I've added the cats library to leverage the functor typeclass abstraction. First, let's implement the Functor typeclass for ParseResult[A]:

:::{.numberLines}

object ParseResult {

  implicit val parseResultFunctor: Functor[ParseResult] = new Functor[ParseResult] {

    override def map[A, B](fa: ParseResult[A])(f: A => B): ParseResult[B] =
      fa match {
        case ParseFailure(failedCursor)        => ParseFailure(failedCursor)
        case ParseSuccess(matched, nextCursor) => ParseSuccess(f(matched), nextCursor)
      }
  }
}

:::

This allows us to transform the contents of the ParseResult at-will.

Changing the return type of the parse() function itself

The parse() function itself is a functor. But that doesn't quite make sense: it doesn't contain anything because it's a function!

However, it still has the shape of a functor. Recall the new signature that it will have InputCursor => ParseResult[A]. It's covariant in A, giving it the shape of F[A]. So how do we treat it as a functor? Let's change the signature first and see how the code breaks.

And all of the parse() functions break! So we will step through each and get them working again, and we will leverage the ParseResult functor.

  1. satisfy() only requires that we change the return:

    def satisfy(predicate: Char => Boolean): Parse[Char] = cursor =>
      if (predicate(cursor.peek)) {
        ParseSuccess(cursor.peek, cursor + 1)
      } else {
        ParseFailure(cursor)
      }
    
  2. The & combinator has a flatMap() screaming to be abstracted:

    def &[B](second: Parse[B]): Parse[(A, B)] =
      cursor =>
        this(cursor) match {
          case ParseFailure(failedCursor) => ParseFailure(failedCursor)
          case ParseSuccess(firstMatched, nextCursor) =>
            second(nextCursor).map(secondMatched => (firstMatched, secondMatched))
        }
    
  3. The | combinator requires an abstraction to act on failure:

    def |(second: Parse[A]): Parse[A] =
      cursor =>
        this(cursor) match {
          case _: ParseFailure    => second(cursor)
          case s: ParseSuccess[_] => s
        }
    
  4. term() is affected the most, and the rewrite requires threading a cursor through a lot of parse() functions and handling the results. It even requires adding a specialized parse() function to always return a Unit and succeed!

    def unit: Parse[Unit] = cursor => ParseSuccess((), cursor)
    
    def term(value: String): Parse[String] =
      outerCursor =>
        value
          .map(c => satisfy(_ == c))
          .foldLeft(unit)((up, cp) => innerCursor => (up & cp)(innerCursor).map(_ => ()))(outerCursor)
          .map(_ => value)
    

The parse() function as a functor

First, let's define a Functor instance for Parse:

:::{.numberLines}


implicit val parseInstances: ParseInstances = new ParseInstances()

class ParseInstances extends Functor[Parse] {

  override def map[A, B](fa: Parse[A])(f: A => B): Parse[B] =
    cursor => fa(cursor).map(f)
}

:::

This leverages the ability to use ParseResult as a functor. The new map() function for parse() allows us to change what the function produces by wrapping the supplied function f: A => B in a new parse() function! We are effectively able to change the "contents" of the function this way.

Using parse() as a functor, we can refactor the term() function:

:::{.numberLines}

def term(value: String): Parse[String] =
  value
    .map(c => satisfy(_ == c))
    .foldLeft(unit)((up, cp) => (up & cp).map(_ => ()))
    .map(_ => value)

:::

Notice that we no longer need to manually pass around the input cursor. We've simply folded a sequence of parse() functions to Char down to a single parse() function to String. The functor abstraction allowed us to declaratively create this function without having to worry about the plumbing; we don't even have to declare the cursor as an explicit input.

Sequential parsing with the parse() function

We can't apply the functor abstraction to the & combinator as it sequences one parse() function after a successful application of a preceding parse() function. This means that it executes imperatively. The specific abstraction we have to use for this kind of execution is a monad.

A monad is a specialization of a functor. We can replace our functor instance of the parse() function with one for monad:

:::{.numberLines}

class ParseInstances extends Monad[Parse] {

  override def flatMap[A, B](fa: Parse[A])(f: A => Parse[B]): Parse[B] =
    cursor =>
      fa(cursor) match {
        case ParseFailure(failedCursor)        => ParseFailure(failedCursor)
        case ParseSuccess(matched, nextCursor) => f(matched)(nextCursor)
      }

  override def tailRecM[A, B](a: A)(f: A => Parse[Either[A, B]]): Parse[B] = {
    @tailrec
    def recur(cursor: InputCursor, parse: Parse[Either[A, B]]): ParseResult[B] = {
      parse(cursor) match {
        case ParseFailure(failedCursor) => ParseFailure(failedCursor)
        case ParseSuccess(matched, nextCursor) =>
          matched match {
            case Left(a)  => recur(nextCursor, f(a))
            case Right(b) => ParseSuccess(b, nextCursor)
          }
      }
    }
    cursor => recur(cursor, f(a))
  }

  override def pure[A](x: A): Parse[A] =
    cursor => ParseSuccess(x, cursor)
}

:::

Monad gives us the following capabilities:

  • We still get map() through a default definition.
  • pure() which lifts a result into a parse() function that will always be recognize the input without consuming from the cursor.
  • flatMap() which allows for sequential execution of one parse() function after another.
  • tailRecM() which provides monadic recursion. We'll be using this later.

Given this monad instance for the parse() function, we can now rewrite the & combinator:

def &[B](second: Parse[B]): Parse[(A, B)] =
  for {
    a <- this
    b <- second
  } yield (a, b)

This & combinator becomes very small. Again, like the term() function we no longer have to manually thread a cursor through each function: we simply declare how the individual parse() functions relate and modify what they collectively produce. Additionally the parse() function is now able to participate in the for comprehension syntax sugar as it defines both the map() and flatMap() operations.

Alternative parsing with the parse() function

The monad abstraction provides imperative execution. This means that we can't implement the | combinator using flatMap() as it requires both the first and second parse() functions to succeed in sequence. We need to lean on the Alternative typeclass to allow us to try one function but use another if it fails.

Let's start at the deciding factor of success or failure of the parse() function: ParseResult. We need to define a function that allows choice between a successful result and a failed one:

:::{.numberLines}

sealed trait ParseResult[A] {

  def orElse(other: => ParseResult[A]): ParseResult[A] =
    this match {
      case _: ParseFailure[_] => other
      case _: ParseSuccess[_] => this
    }
}

:::

If our result is ParseSuccess then orElse() returns the current result. If our result is ParseFailure then orElse() evaluates the other argument and returns that instead, regardless of success or failure. This allows us to try producing one result, use it if it's successful, and otherwise try producing the other result.

With this ability to abstract the selection of success or failure, we can now abstract the same ability for the parse() function by adding the Alternative typeclass trait to the extends clause of ParseInstances:

:::{.numberLines}

class ParseInstances extends Monad[Parse] with Alternative[Parse] {

  // monad declarations...

  override def empty[A]: Parse[A] =
    cursor => ParseFailure(cursor)

  override def combineK[A](x: Parse[A], y: Parse[A]): Parse[A] =
    cursor => x(cursor).orElse(y(cursor))
}

:::

Alternative gives us the following capabilities:

  • empty provides a parse() function that always fails, because it's empty.
  • combineK() allows for the combination of the results produced by the parse() function. Concretely, this means that the first function is attempted, and if it fails then the second function is attempted.

Armed with abstraction over selection of success or failure, we can redefine the | combinator:

:::{.numberLines}

import cats.syntax.semigroupk._

def |(second: Parse[A]): Parse[A] =
  this <+> second

:::

That's all the | combinator is now: this <+> second. By leveraging the Alternative abstraction, we are able to declaratively specify with a single operator that the first parse() function should be attempted before the second.

Putting it all together

Let's try defining a parse() function that's more complex than what we've covered so far: a generic programming language identifier recognizer.

This parse() function should be able to recognize the following inputs:

  • Any alphabetic letter followed by zero or more of alphabetic letters, digits, or underscores.
  • Any underscore as long as it is followed by one or more of alphabetic letters, digits, or underscores.

There's a lot going on in those two requirements. Let's break it down:

  • Recognize one of a range of possible matches from one parse() function.
  • Require zero or more matches from one parse() function.
  • Require at least one or more matches from one parse() function.

First, we need to define a parse() function that can recognize one match from any number of parsers. More specifically, we need two functions in order to do this:

:::{.numberLines}

def empty[A]: Parse[A] = Alternative[Parse].empty[A]

def oneOf[A](parsers: Seq[Parse[A]]): Parse[A] =
  parsers.foldLeft(empty[A])(_ | _)

def oneOf[A](parsers: Parse[A]*): Parse[A] =
  oneOf(parsers)

:::

Here we have a convenience empty[A] parser that leverages the Alternative typeclass we implemented earlier. By defining a oneOf() function that accepts a sequence of parse() functions, we can fold each function to the left starting with an empty parser and alternate each one. The first parse() function that matches will return a success result. If none match, then the final failure result will be returned instead.

How do we use this to accept any one of the alphabetic letters? It would be tedious to write a satisfy() function for each letter by hand, so instead let's specify the letters we want as a range and map over them:

:::{.numberLines}

def alpha: Parse[Char] =
  oneOf((('a' to 'z') ++ ('A' to 'Z')).map(c => satisfy(_ == c)))

:::

We can do the same thing with digits:

:::{.numberLines}

def digits: Parse[Char] =
  oneOf(('0' to '9').map(c => satisfy(_ == c)))

:::

We also need a parse() function that allows for recognizing zero or many matches and another for one or many:

:::{.numberLines}

def zeroOrMany[A](parse: Parse[A]): Parse[Seq[A]] =
  Monad[Parse].tailRecM(Seq[A]()) { seq =>
    parse.map(a => (seq :+ a).asLeft[Seq[A]]) |
      seq.asRight[Seq[A]].pure[Parse]
  }

def oneOrMany[A](parse: Parse[A]): Parse[Seq[A]] =
    (parse & zeroOrMany(parse)).map { case (first, rest) => first +: rest }

:::

Notice how our zeroOrMany() function leverages the monadic recursion capability from earlier.

Armed with these functions, we can now recognize identifiers!

:::{.numberLines}

def identifier: Parse[String] = {
  val underscore = satisfy(_ == '_')
  val rest = alpha | digit | underscore
  val alphaHead = alpha & zeroOrMany(rest)
  val underscoreHead = underscore & oneOrMany(rest)
    (alphaHead | underscoreHead).map { case (head, rest) => (head +: rest).mkString }
}

:::

But that seems almost too easy! We better write a bunch of tests to make sure this recognizes input as expected.

:::{.numberLines}

"identifier" can {
  "recognize a single alpha letter" in {
    identifier(InputCursor(0, "a banana")) shouldBe ParseSuccess("a", InputCursor(1, "a banana"))
    identifier(InputCursor(0, "y not")) shouldBe ParseSuccess("y", InputCursor(1, "y not"))
  }
  "recognize an alpha letter followed by alphas, digits, and underscores" in {
    identifier(InputCursor(0, "a_123_Word <-")) shouldBe
      ParseSuccess("a_123_Word", InputCursor(10, "a_123_Word <-"))
  }
  "reject a single underscore" in {
    identifier(InputCursor(0, "_ rejected")) shouldBe ParseFailure(InputCursor(1, "_ rejected"))
  }
  "recognize an underscore followed by alpha digits and underscores" in {
    identifier(InputCursor(0, "_a_123_Word <-")) shouldBe
      ParseSuccess("_a_123_Word", InputCursor(11, "_a_123_Word <-"))
  }
  "reject a digit" in {
    identifier(InputCursor(0, "1234")) shouldBe ParseFailure(InputCursor(0, "1234"))
  }
}

:::

What's next?

With parser combinators you are able to process text into any structure you need. In this post we only covered tuples, Seqs, and Strings, but parse functions can be written to produce full AST's or even interpret scripted programs.

Some subjects for next time:

  1. Parsing mathematical infix expressions and calculating their results.
  2. Error handling and recovery, an especially tricky subject!
  3. Stateful parsing, which requires lifting the parse() function into a higher-order monad.