A Perl Grammar Tutorial ( Beginner’s Guide )

This tutorial became part of the official Perl 6 documentation as Grammar Tutorial where it became a “living” document which can be edited and expanded upon by others. As such, this document may be out of date, for better or for worse.

Why Grammars?

You will always run into string parsing that will give you a headache. It it said that HTML, for example, cannot be broken down and parsed effectively at all, simply by using regular expressions to sort out the elements. Another example would be defining the order in which words and symbols might make up a language and provide meaning. This is when Perl’s grammar system fits perfectly.

When Would I Use Grammars?

Grammars are great at taking stings, trying to make sense of them, and then saving them into a data structure you can actually work with. If you have strings of some kind that you need to bring order to, or interpret, grammars give you some great tools to make it easier.

Your string might be a whole file that you need to break into sections. Or perhaps line by line. Maybe you have a protocol like SMTP you’re working with, and want a convenient and organized way to define which “commands” need to come after what data from the user to make the protocol work. Maybe you’re wanting to create your own string-based protocol. Maybe you’re designing your own language.

The Broad Concept of Perl Grammars

Regular Expressions (regex) work well to find patterns in strings and manipulate them. However, when you need to find multiple patterns at once, or need to combine patterns, or test for patterns that may surround strings, or other patterns – regular expressions alone are not adequate.

Grammars provide a way to define how you want to examine a string using regular expressions, and you can group these regular expressions together to bring even more meaning.

For example, in the case of HTML, you could define a grammar that would recognize HTML tags, both the opening and closing elements, and the text in between, and act upon these elements as a whole by stuffing them into data structures, such as arrays or hashes, that you can then easily use. In essence, Grammars provide a means to define an entire language or specification that can be used to parse strings of arbitrary sizes and complexity.

Getting More Technically into Grammars

The conceptual overview

Grammars are defined as an object, just like everything else in Perl. Technically, they are normal classes with a little extra magic thrown in, which we’ll get to later — and a few limitations. You name and define a grammar exactly as you would a class, except using the “grammar” keyword instead of “class”.

grammar My::Gram { ..methods 'n stuff... }

Grammars contain elements like methods, called regex, token or rule. These elements are named, just like methods are named. And each one defines a regex, token or rule (which are mostly the same thing (not really) but we’ll get to that later).

Once you have your grammar defined, in your program call it by name and pass in the string you want to parse, and that string will be run through the rules you defined based upon your regex, token and rule “methods”.  When done, you get back a Match object that has been populated with data structured and stored by the names you used to define your methods.

my $matchObject = My::Gram.parse($what-a-big-string-you-have);

Now, you may be wondering, if I have all these regexes defined that just return their results, how does that help with parsing things that may be forwards or backwards in a string, or things that need to be combined from multiple of those regexes… and that’s where grammar actions come into play.

For every “method” you match in your grammar, you get an action you can call to do something funny or clever with that match. You also get an over-arching action you can use to tie them all together and custom build a data structure you might want to return, where all your crazy string parsing makes sense in your nicely ordered and defined data structure. This over-arching method is called TOP by default. We’ll get more into this as well.

The technical overview

Grammars are defined just like a class, but using the grammar keyword in place of class. The “methods” in grammars are called either regex, token or rule. Regex methods are slow but thorough — they will look back in the string and really try. Token methods are much faster and they ignore whitespace. Rule methods are the same as token methods except that they pay attention and consume whitespace in your “regex” definitions.

When a method (regex, token or rule) matches in the grammar, that matched string is put into the Match object that will eventually be returned, and it will be keyed with the same name as the method you chose to name it.

grammar My::Gram {
    token TOP { <thingy> .* }
    token thingy { 'clever_text_keyword' }
}

So in this, if you were to my $match = My::Gram.parse( $string ) — and your string started with the characters ‘clever_text_keyword’, you would get a match object back that contained ‘clever_text_keyword’ keyed by the name of ‘thingy’ in your match object. These can grow much more complex, as your needs require, as you might imagine.

Now, to mention TOP. The TOP method (regex, token or rule) is the overarching regex that must match everything (by default). If the string you pass in to parse doesn’t match the TOP regex, your returned match object will be empty (Any).

As you can see above, in TOP, the “<thingy>” token is mentioned. The <thingy> is defined on the next line, “token thingy…”. That means that ‘clever_text_keyword’ must be the first thing in the string passed in, or the grammar parse will fail, and we’ll get an empty match. This is great for recognizing malformed stuff that someone might give you that should be thrown away.

Learning By Example – a REST Contrivance

Let’s suppose we’d like to parse a URL into the component parts that make up a RESTful request. Let’s decide that we want the URL’s to work like this:

  1. The first part of the URI we’ll call the “subject”, like a part, or a product, or a person.
  2. The second part of the URI we’ll call the “command”, like standard CRUD stuff (create, retrieve, update, or delete).
  3. The third part of the URI will be arbitrary data. Perhaps the specific ID we’ll be working with, or a long list of data separated by “/”‘s.
  4. When we get a URL, we’ll want 1-3 above to be placed into a nice data structure we can use without having to do all sorts of splitting, and that can be easily altered in the future or expanded upon (or extended).

So if we got a URI on the server of “/product/update/7/notify” we would want our grammar to give us a nice $match object that has a “subject” of “product”, a “command” of “update” and “data” of “7/notify” (for now).

The first thing we do is define the grammar class. We’re going to need to define our subject, command and data as well. I think we’ll use token for them, since we don’t care about whitespace in the regex.

grammar REST {
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}

So far this REST grammar says we want a subject that will be just word characters, a command that will be just word characters, and data that will be everything else left in the string (URI in this case).

But in our big string we get, we don’t know what order these regex matches will come in. We need to be able to place these matching tokens in the larger context of our URI we’ll be passing in as that string. That’s what the TOP method is for. So we add it, and place our tokens by name within it, along with however else our valid string should look, coming in.

grammar REST {
    token TOP     { '/' <subject> '/' <command> '/' <data> }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}

You could actually use this to extract your data from the URI for basic CRUD that has all 3 parameters included:

my $match = REST.parse('/product/update/7/notify');
say $match;

「/product/update/7/notify」
 subject => 「product」
 command => 「update」
 data => 「7/notify」

Of course, the data can be accessed directly by using $match<subject> or $match<command> or $match<data> to return the values parsed. They each contain match objects you can work further with, or coerce into a string ( $match<command>.Str )

Adding some flexibility

The REST grammar so far will handle retrieves, deletes and updates ok. However, a create command doesn’t have the third part (the data portion). This means our grammar will fail to match if we try to parse an update URL, and everyone will scream. To avoid this, we need to make that last data position match optional, along with the ‘/’ preceding it. This is easily accomplished by adding two question marks to the TOP token, to indicate their optional nature, just like a normal regex. So now we have:

grammar REST {
    token TOP     { '/' <subject> '/' <command> '/'? <data>? }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }
}

my $m = REST.parse('/product/create');
say $m<subject>, $m<command>;

# 「product」「create」

Let’s imagine, for the sake of demonstration, that we might want to allow these same URI’s to be entered in by a user from the terminal. In that case, they might put spaces between the ‘/’s, since users are prone to break things. If we wanted to accommodate this possibility, we could replace the ‘/’s in TOP with another token that allowed for spaces on either side of it.

grammar REST {
    token TOP     { <slash><subject><slash><command><slash>?<data>? }
    token subject { \w+ }
    token command { \w+ }
    token data    { .* }

    token slash   { \s* '/' \s* }
}

my $m = REST.parse('/ product / update /7 /notify');
say $m;

###
「/ product / update /7 /notify」
 slash => 「/ 」
 subject => 「product」
 slash => 「 / 」
 command => 「update」
 slash => 「 /」
 data => 「7 /notify」

We’re getting some extra junk in our match object now, with those slashes, but there are some very nice ways to make a tidy return value that we’ll get to.

Adding some constraints

We want our RESTful grammar to allow for CRUD operations only. Anything else we want to fail to parse. That means our “command” above should have one of four values: create, retrieve, update or delete.

There are several ways to accomplish this. For example, you could change the command method:

token command { \w+ }
    
  ..becomes..

token command { 'create'|'retrieve'|'update'|'delete' }

This results in any URI coming in getting checked; where the second string between ‘/’s must be one of those values, or the parsing of the URI fails. Exactly what we’d like.

There is another way, though, that can give you far greater flexibility to do interesting or unholy things with your regexes, and provide some better readability when options grow large. These are proto-regexes.

To utilize these multimethods (here called protoregexes) to constrain our command to the same values we had above, we’ll replace “token command” with the following:

proto token command {*}
token command:sym<create>   { <sym> }
token command:sym<retrieve> { <sym> }
token command:sym<update>   { <sym> }
token command:sym<delete>   { <sym> }

Using protoregexes like this gives us some greater flexibility. For example, instead of returning <sym>, which is the string that was matched, we could instead enter our own string, or do other funny stuff. We could do the same with the “token subject” method, and limit it also to only parsing correctly on valid subjects (like ‘part’ or ‘people’, etc.).

Putting our RESTful grammar together so far

This is what we’ve come up for processing our RESTful URIs so far:

grammar REST
{
    token TOP { <slash><subject><slash><command><slash>?<data>? }
    
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
    
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}

Let’s look at various URI’s and how they behave being passed through our grammar.

my @uris = ['/product/update/7/notify',
            '/product/create',
            '/item/delete/4'];

for @uris -> $uri {
    my $m = REST.parse($uri);
    say "Sub: $m<subject> Cmd: $m<command> Dat: $m<data>";
}

###
Sub: product Cmd: update Dat: 7/notify
Sub: product Cmd: create Dat: 
Sub: item Cmd: delete Dat: 4

So with just this part of a grammar, we’re getting almost everything we need. Our URI’s get efficiently parsed and we’re given a nice little data structure for the variables we need to work with.

But look at that first line returned — the data token is returning the entire end of the URI as just one string. We need to be able to work with that 7 there. And that 4! Well, the 4 is easy… But the 7 had the extra /notify on the end, to signal the system to notify someone that a product was updated (perhaps).

So let’s make sure we can do stuff with our regex tokens that were matched, such as that data token that returned a “7/notify”. And to do so, we’ll take advantage of another characteristic of these Grammar classes — a thing called actions.

Grammar Actions

We’re going to diverge from our example for a moment to talk about Perl’s grammar actions. We’re going to do this because, in many ways, grammar actions are separate from the grammars you define. They are a completely different class that you create, and use from your grammars to do stuff with the matches you find in your grammars.

You can think of grammar actions as a kind of plug-in expansion module for grammars. A lot of the time you’ll be happy using grammars just on their own. But when you need to process some of those strings further, you can plug in the expansion module.

You do this when you first create your grammar. In addition to passing in the actual string you want to parse, you can pass in a named parameter called “actions” which should contain an instance of your actions class. From our example above, if our actions class were called REST-actions we would parse our grammar string like this

my $matchObj = REST.parse($uri, actions => REST-actions.new);

   ...or if you prefer...

my $matchObj = REST.parse($uri, :actions(REST-actions.new))

Grammar actions get pulled into your grammar in this way. And once there, your grammar can utilize the actions you define. And those defined actions are defined as normal methods in your action class.

The only weird bit is that if you name your action methods with the same name as your grammar methods (tokens, regexes, rules), then when your grammar methods match, your action method with the same name will get called automatically for you, and it will be passed the match object from that specific grammar token that matched.

In other words, when you’ve attached an action class, name the methods in that class with the same names you used in your grammar class, if you want actions to be called automatically when grammar regexes match.

Matching actions will get passed the Match object from the grammar token as its argument, and this can be represented by the $/ variable. This means that all your action methods should take $/ as the argument, then you get to work with that grammar match object in your action. So much for words, let’s return the example scenario to try and find more solidified clarity.

Grammars by Example with Actions

Here we are back to our grammar, as we left it.

grammar REST
{
    token TOP { <slash><subject><slash><command><slash>?<data>? }
    
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
    
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}

Now we want to make sure that our data token gets parsed a little bit further, so that we can separate out an ID if we get it, from the rest of the long URL that might follow, such as “7/notify” in our example.

To accomplish this we’ll create an action class, and in it, create a method with the same name as the named token, rule or regex we want to process. In this case, our token is named “data”.

class REST-actions
{
    method data($/) { $/.split('/') }
}

Now when we pass the URL string through our grammar, the “data” token match will be passed to the action class (REST-actions) to the method “data”, and we’ll split that URI string by its ‘/’ character. That way, the first element of the returned list will be our ID number (7 in the case of “7/notify”).

But not really.

Keeping grammars with actions tidy with “make” and “made”

If our grammar calls our action above on data, the data method will be called, but nothing will show up in the big TOP grammar match result returned to our program. In order to make our action results show up, we need to call “make” on that result, and that result can be many things, including strings, array or hash structures.

You can imagine that the “make” we put on our action results, places that result in a special, contained area in our whole grammar. Everything that we “make” for data structures, can be accessed later by “made”.

So instead of our REST-actions class above, we should write

class REST-actions
{
    method data($/) { make $/.split('/') }
}

When we add “make” to our match split (which returns a list), our action will return a nice data structure to our grammar that will be stored separately from the “data” token of our original grammar. This way, we can still work with both if we need to.

Now if we want to access just our id of 7 from that long URL, we can access the first element of the list returned from the “data” action we “made” with “make”:

my $uri = '/product/update/7/notify';

my $match = REST.parse($uri, actions => REST-actions.new);

say $match<data>.made[0];  # 7
say $match<command>.Str;   # update

Here, we call “made” on our data, because we want the result of our action that we “made” (with “make”), to get our split array. That’s lovely! But, wouldn’t it be lovelier if we could “make” a friendlier data structure that contained all of the stuff we want, rather than having to coerce types and remember arrays?

Well, just like TOP in our grammar that over-arches and matches the entire string, our actions have a TOP method as well. We can “make” all of our individual match components, like “data” or “subject” or “command” and then we can place them in our perfect fantasy data structure which we will “make” in TOP. When we return our final match object from our grammar processing, we can then access this perfectly-imagined data structure that we’ve cobbled together, to make everyone’s life happy and easier.

To do this, all we have to do is add the method TOP to our action class, and in that method, “make” whatever data structure we like from our component pieces. So, our action class now might become

class REST-actions
{
    method TOP ($/) {
        make { subject => $<subject>.Str,
               command => $<command><sym>.Str,
               data    => $<data>.made }
    }

    method data($/) { make $/.split('/') }
}

Here in our TOP method, our “subject” remains the same as the subject we matched in our grammar. Also, our “command” returns the valid <sym> that was matched (create, update, retrieve, or delete). Each we coerce into .Str as well, since we don’t need the full match object.

But what we want to be certain to do, is to use the “made” method on our $<data> object, since we want to access that split one that we “made” with “make” in our action, rather than the proper $<data> object.

After we “make” something in the TOP method of a grammar action, we can then access all of our custom-made stuff by calling the “made” method on our grammar result object. From our earlier example, it now becomes

my $uri = '/product/update/7/notify';

my $match = REST.parse($uri, actions => REST-actions.new);

my $rest = $match.made;
say $rest<data>[0];   # 7
say $rest<command>;   # update
say $rest<subject>;   # product

Of course you could also shorten it if you knew you wouldn’t need the full grammar return match object, and only wanted your custom-made data from your action’s TOP.

my $uri = '/product/update/7/notify';

my $rest = REST.parse($uri, actions => REST-actions.new).made;

say $rest<data>[0];   # 7
say $rest<command>;   # update
say $rest<subject>;   # product

Oh, did we forget to get rid of that ugly array element number? Hmm. Let’s make a new thing in our grammar’s custom return in TOP… how about we call it “subject-id” and have it be set to element 0 of <data>.

class REST-actions
{
    method TOP ($/) {
        make { subject    => $<subject>.Str,
               command    => $<command><sym>.Str,
               data       => $<data>.made,
               subject-id => $<data>.made[0] }
    }

    method data($/) { make $/.split('/') }
}

Now we can do this instead

my $uri = '/product/update/7/notify';

my $rest = REST.parse($uri, actions => REST-actions.new).made;

say $rest<command>;    # update
say $rest<subject>;    # product
say $rest<subject-id>; # 7

And this is the grammar and grammar actions that got us there, to recap:

grammar REST
{
    token TOP { <slash><subject><slash><command><slash>?<data>? }
    
    proto token command {*}
    token command:sym<create>   { <sym> }
    token command:sym<retrieve> { <sym> }
    token command:sym<update>   { <sym> }
    token command:sym<delete>   { <sym> }
    
    token subject { \w+ }
    token data    { .* }
    token slash   { \s* '/' \s* }
}
class REST-actions
{
    method TOP ($/) {
        make { subject    => $<subject>.Str,
               command    => $<command><sym>.Str,
               data       => $<data>.made,
               subject-id => $<data>.made[0] }
    }

    method data($/) { make $/.split('/') }
}

Hopefully this has helped introduce you to the concept of grammars in Perl and how grammars and grammar action classes can tie together. For further information, insights and oddities, check out the more advanced Perl grammar guide.

Leave a Reply

Your email address will not be published. Required fields are marked *