Practical parsing with PEG and cpp-peglib

A very practical introduction to Parsing Expression Grammars (PEGs), in which we’ll build a non-trivial parser using the most excellent cpp-peglib single-include C++ library. Post includes links to PEG libraries for Go, Rust and Python, and a ready to run GitHub repository of all examples.

If I’ll ever stop programming it will be because I can no longer find the energy to parse even just one more string format.

Way back in 2001 I wrote an introduction to Lex and Yacc which still gets an astounding 600 non-bot visitors per month. Lex and Yacc remain very powerful tools, but they don’t integrate too well with modern programming languages. Also, they take some work to set up these days. But for 23 years I could find nothing obviously better. And neither could a lot of other people.

This means that a lot of terrible parsing is going on, using stuff like regexes or artisanally crafted string slicers and dicers.

Recently during the development of simplomon I ran into Yuji Hirose’s lovely cpp-peglib, and finally found something that is so simple to use that I no longer have any excuse to write my own dubious quality parsers.

cpp-peglib lives up to the ancient ideal of Perl, “easy things should be easy, hard things should be possible”. Unlike many other parser projects, I should note. cpp-peglib meanwhile can parse .docx files with ease should you want to.

The goal of this page is to rapidly get you up to speed with PEG. All the code examples can be found on GitHub. The examples are written in C++ and use cpp-peglib, but this page will likely be useful in general to learn about Parsing Expression Grammars. At the end of this post you’ll find links to PEG libraries for Go, Python and Rust, plus some interesting further reading.

Note that the cpp-peglib README isn’t bad, but it is better at exhaustively showcasing the capabilities than it is at getting you up to speed quickly. After reading the present page, the cpp-peglib README should be much more helpful.

Vector

As a trivial example, let’s say we need to parse a vector that looks like this: (0.123, -.987, 2.4, 3). This is so simple you might be tempted to remove the open and closing parens, split the string on ‘,’, remove leading and trailing spaces, and convert the remaining strings into floats. Very doable. Maybe 15 lines of code with error checking. Once in a while you might even attempt a regular expression to see if you can get it to work this time.

Here’s how you’d parse this vector in cpp-peglib (full code is here):

#include "peglib.h"
...
peg::parser p(R"(

Vector      <- '(' Number ( ',' Number )* ')'
Number      <- [+-]?[0-9]*([.][0-9]*)?
%whitespace <- [\t ]*

)");

Using cpp-peglib is as easy as including “peglib.h”. There’s no need for any pre-processing, external tools or libraries.

The snippet above shows the PEG grammar of our vector format. This is different from what Lex/Yacc users are used to and if you attempt “to do Yacc” in PEG, you are going to have a bad time, specifically with recursion. Luckily most things are easier in PEG-land than in the world of LALR parsers.

First line: Vector <- '(' Number ( ',' Number )* ')'

In the first line we define the end-product of our parsing efforts, a Vector. This consists of an opening parenthesis, a Number, and then zero or more ‘, Number’ repeats, followed by a closing paren.

Second line: Number <- [+-]?[0-9]*([.][0-9]*)?

The next line uses the dreaded regular expression syntax to define what a Number looks like. Now, doing regular expressions for anything non-trivial is terrible and leads to multiple problems. But here it is trivial enough to work, but later on this page we’ll find you still need to be careful. Note that the PEG regular expressions are greedy.

Third line: %whitespace <- [\t ]*

This final line is some very helpful cpp-peglib syntactic sugar that mops up any whitespace between numbers and commas.

Next up, let’s get some actual data out of this parser:

p["Number"] = [](const peg::SemanticValues &vs) {
  return vs.token_to_number<double>();
};

If you come from a C or older C++ background, the above may look very weird. Earlier I wrote a modern C++ tutorial for C programmers which might be useful.

Using this syntax, we can attach functions to anything we defined in the grammar. From the vs object we can get the token, which is the literal text that matched the Number regex. In this case we take that matching text and convert it to a double, and return it to cpp-peglib. So far so simple. Next up, the Vector, which as defined above contains 1 or more Numbers:

p["Vector"] = [](const peg::SemanticValues &vs) {
  vector<double> ret;
  for(const auto& v : vs)
    ret.push_back(any_cast<double>(v));
  return ret;
};

The vs object we got passed is also itself a vector<std::any>. And in those std::any’s we find the double values we returned for our Number instances.

A std::any can, with some overhead, contain an instance of ’every type’. You can at runtime cast such a std::any back to what you put in there, and if try the wrong type, an exception is raised.

The key thing here is that the peg::SemanticValues object contains:

  1. a view on the complete token (text) that matched this rule
  2. std::any objects that store what the functions associated with the rules in this rule returned, in the order they were listed in the rule.

In the code above we cast all the values in vs to double and put them in a vector<double> which we return.

Note, there is a shorthand that does this conversion to another vector in one go: return vs.transform<double>()

Finally, this code makes it all happen:

vector<double> result;
if(p.parse(argv[1], result)) {
  fmt::print("Parse result of '{}': {}\n",
           argv[1], result);
}

In the second line the vector<double> that we created earlier pops out again. We then use the most awesome {fmt} library to pretty print that vector:

$ ./hello "(0.123, -.987, 2.4, 3)"
Parse result of '(0.123, -.987, 2.4, 3)': [0.123, -0.987, 2.4, 3]

Now, even for this very trivial case, using a “real” parser did not use much more lines than dedicated hand-crafted code would’ve used. And from here on it is only going to get better.

Escapes

Many parser demos lightly skip over practical things like actually getting data out of the parser again (which we covered above). They also frequently gloss over stuff like escaping. Yet escaping is a vital part of everything that uses delimiters, which is really everything.

Here’s a PEG grammar (full code) for a string format like "This is a \"test\" string":

QuotedString  <- '"' String '"'
String        <- (! '"' Char )*
Char          <- ('\\' . ) / 
                 (!'\\' .)

The first line simply says that our string is quoted. The second line is already more interesting - the unquoted string consists of zero or more Chars. The ! operator is exceptionally helpful, it looks ahead to the next character, and if that turns out to be a ", it leaves that in the buffer, and ends the String. That " will then be matched by QuotedString.

The final line is where the magic happens. This line offers two choices, separated by the /. Unique to PEG is that this is not an ‘or’ - the matches will be attempted in order, and the first that fully works (including following rules!) gets chosen. ‘/’ is called the ‘Choice operator’. The first choice would match \". Because this choice consumes the ", it is never seen by the QuotedString, so we don’t terminate the string early.

The second choice of the last line matches every character, as long as it isn’t a \.

It might seem like we are done now, but we aren’t. Here’s how we get the data out of a Char:

p["Char"] = [](const peg::SemanticValues &vs) {
  string res = vs.token_to_string();

  if(vs.choice() == 0) { // this was an escape
    res = res.substr(1); // skip the "\"
  }
  return res;
};

Some mechanics here. Recall how the Char line had two choices separated by /. Also recall how the token in vs contains all the matched text. In this case, that means that if we encounter \" in a string, the token will also contain the \, which is what we don’t want to end up in our output string. That’s why we skip the first character (\) if this was an escape.

Note that we return each Char as a full string. This might seem wasteful, but might help us later if we also want to support escapes like \uA7FE. Which brings us to some further good news: cpp-peglib supports Unicode out of the box.

Next up, we need to gather these Char strings into an actual string:

p["String"] = [](const peg::SemanticValues &vs) {
  string ret;
  for(const auto& v : vs) {
    ret += any_cast<string>(v);
  }
  return ret;
};

Now, this having to manually skip the \ in the Char might get tedious. Luckily cpp-peglib supports a nice extension, which allows you to say:

Char <- ('\\' < . > ) / (!'\\' .)

Note how the . is now surrounded by a < and > pair. This delineates which part of the matching sequence you want to return as the token. This leads to the following p["Char"], which no longer really has to do anything:

p["Char"] = [](const peg::SemanticValues &vs) {
  return res = vs.token_to_string();
};

Now that we have a professional quoted string parser, we do have time for some further thinking. For example, what to do with this? "Hello,\nthis is on a new line". Do we want to turn \n into a newline? Or leave it as is? And where would you stop? The more exotic (and weird) octal modes from C? Lots of stuff to ponder, and for many string formats, no one really knows. But if you know what you want, you can trivially add it to p["Char"].

A worked example: parsing Prometheus data

To round this off, let’s create a parser that accepts Prometheus metrics, and serves them up as a C++ object of accessible objects. To make it somewhat useful, we’ll also convert this object into JSON.

Happily, the Prometheus people have actually documented their format rather well. Specifically:

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

Where:

  • label_value can be any sequence of UTF-8 characters, but the backslash (\), double-quote ("), and line feed (\n) characters have to be escaped as \\, \", and \n, respectively
  • value is a float represented as required by Go’s ParseFloat() function. In addition to standard numerical values, NaN, +Inf, and -Inf are valid values representing not a number, positive infinity, and negative infinity, respectively.
  • The timestamp is an int64 (milliseconds since epoch, i.e. 1970-01-01 00:00:00 UTC, excluding leap seconds)

Then there are three kinds of comments, “HELP”, “TYPE” and “random”. Together it looks like this:

# Some random comment. Following comment lines are not so random:
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 73
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{arch="",origin=""} 0 1712003163000

Here is the start of our grammar:

root        <- ( ( commentline / vline ) '\n')+
commentline <- ('# HELP ' name ' ' comment) /
               ('# TYPE ' name ' ' comment) /
               ('#' comment)
comment     <- (!'\n' .)*
name        <- [a-zA-Z0-9_]+ 

As root we declare that our file consists of comment lines and ‘value lines’. The comment lines in turn consist of three kinds, ‘HELP’, ‘TYPE’ and ‘random comment’.

Here we already have an interesting choice to make. The Prometheus specification states that any ‘HELP’ and ‘TYPE’ lines must form a block together with the value lines of the same metric. However, the spec does not fully pin down the order of the lines within the block.

In what follows, I’m therefore going to disregard the ‘block’ concept as it does not do a lot for us. The parser we’re about to make would accept a Prometheus file in any order, something that is far more liberal than the spec. Depending on your mindset this could be good or bad. Specifically, you can’t use our grammar to validate a Prometheus set of metrics.

Next up, the actual lines with values:

vline <- (name ' ' value (' ' timestamp)?)  /
         (name labels ' ' value (' ' timestamp)?) 

Prometheus lines can optionally contain labels or a timestamp. This means that once a vline is processed, it could have 2, 3 or 4 values. The 3 value could mean a line with labels or a line with a timestamp. To disambiguate, we use the choice operator ‘/’. This will tell us explicitly if we got a line with labels or not.

Here is a sample line with labels and a timestamp:

apt_upgrades_held{arch="",origin=""} 0 1712003163000

To parse the labels:

labels      <- '{' nvpair (',' nvpair)* '}' 
nvpair      <- name '=' '"' label_value '"' 
label_value <- (!'"' char)*
char        <- ('\\' . ) / 
               (!'\\' .)

This implements the escaping trick outlined above.

Finally there are two tricky lines:

value     <- '+Inf' / '-Inf' / 'NaN' / [0-9.+e-]+ 
timestamp <- [+-]?[0-9]*

The Prometheus specs explicitly state that positive and negative infinity are allowed values, as is NaN. We split this up in four choices, where the order is significant. The regular expression at the end is sloppy, and matches a lone ‘+’ for example. This means that “+Inf” would match partially, and then the parser would be confused by the unexpected “Inf” that follows. By ordering the choices judiciously we prevent this problem.

Now that the grammar is done, here are a few important parts of processing the data. First up, turning a value, including infinities, into a double:

p["value"] = [](const peg::SemanticValues &vs) {
  if(vs.choice() == 0 )  // +Inf
    return numeric_limits<double>::infinity();
  else if(vs.choice() == 1 )  // -Inf
    return -numeric_limits<double>::infinity();
  else if(vs.choice() == 2 )  // NaN
    return numeric_limits<double>::quiet_NaN();
    
  return vs.token_to_number<double>();
};

Next up, the escaping according to the Prometheus spec:

p["char"] = [](const peg::SemanticValues &vs) {
  string res = vs.token_to_string();
  if(vs.choice() == 0) { // this was an escape
    char c = res.at(1);
    if(c != '\\' && c != 'n' && c != '"')
       throw runtime_error(fmt::format("Unknown escape sequence '\\{}'", c));
    if(c == 'n') 
	  c = '\n';
    return c;
  }
  return res.at(0);
};

For the rest, I recommend that you look at all 200 richly commented lines of promparser.cc. This file also includes commands that enable useful error reporting.

To see how to turn this data structure into JSON, head to prom2json.cc. A Prometheus test file is included, which exercises the UTF-8 capabilities, as well as the escaping code, plus the +Inf, -Inf and NaN floating point numbers.

Finally the unit tests in promtests.cc are instructive.

If at this point you feel that you could write a Prometheus parser by hand without too much trouble, see if you can make one that passes these unit tests, and how long it took you.

A brief note on regular expressions

Some astute readers have already noted that the regular expressions meant to match numbers in the examples above will also match things that are definitely not numbers, like ‘+’, ‘.’ or even things like ’eeee’. One could use better regular expressions, like [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+) in hopes of improving things. But such monstrosities hurt the eyes.

Instead, it is far easier to check in the attached parsing function if the input can be parsed as an actual floating point number. We don’t need to rely on the regex to protect us from that. In addition to the normal hook, cpp-peglib also offers parser["NUMBER"].predicate to implement such checks, and from there you can set an error message that should be helpful for the user.

What is important is that the regular expression does not match beyond the number string in the input text. But as long as you take care of that, it is not a problem if your regex isn’t a complete validator for the concept of a floating point number.

Next steps

cpp-peglib is more capable than outlined above, and also comes with tooling and a live website to help you develop your grammars. It also features more advanced modes which allow you to parse more complicated languages, like HTML and XML. For lovers of parsing, it can also generate an AST.

cpp-peglib implements a large superset of the original PEG article, including enhancements that make life a lot easier and save a lot of typing.

Summarising

Parsing Expression Grammars are simple to write, and with the right tooling (like cpp-peglib) also simple to use. In under 200 lines we wrote a fully compliant parser of a non-trivial format, including escaping.

I sincerely hope that the above will help you wrote robust and secure parsers that outdo what you would type in yourself.

Further reading and other languages