pacomb

Overview

PaComb is a parsing libraries that compiles grammars with semantic actions to combintators that can be used for parsing. Languages are defined with the Pacomb.Grammar module or preferably through our PPX extension. The library offers scanner less parsing, but the Pacomb.Lex module provides a notion of terminals and the Pacomb.Blank module allows to define function ignoring spaces and comments. This and other modules like Pacomb.Keywords and Pacomb.Word_list makes the scanner less approach very easy to use.

Pacomb is available via opam and from github

The main advantage of PaComb is to offer many features in one tool:

  1. BNF/EBNF like syntax to define grammars directly in your ML file using our ppx extension
  2. Parsing from left to right despite the use of combinators (hence the beginning of the input buffer can be collected by the GC).
  3. Support for ambiguous grammars with polynomial complexity using cache and merge (see Cache and merge for ambiguous grammars)
  4. Support for self extensible grammars via Dependent sequences
  5. Rejection of rule from action code using Pacomb.Lex.give_up
  6. Support for utf8 characters and graphemes (grapheme is the correct notion as it is atomic for unicode normalisation)
  7. Lexical convention (blank, comments, etc...) can change during parsing using Pacomb.Grammar.layout and the fact that we use scanner less parsing.
  8. Possibility to print the grammar (not perfect yet). This is useful to implement helps in your program using Pacomb.Grammar.print_grammar.

Importantly, performances of PaComb are very good: it is often less than two times slower than grammars generated by ocamlyacc. However, using specific Pacomb features like dependant sequences or cache will result in much slower grammar ... that you can not in general write with ocamlyacc anyway.

Pacomb modules

  1. Pacomb.Grammar The main module to build grammars
  2. Pacomb.Lex lexing and Pacomb.Lex.give_up
  3. Pacomb.Blank blanks and layout record
  4. Pacomb.Pos positions in file and handling of exceptions
  5. Pacomb.Keywords Some function to parse and reserve keywords
  6. Pacomb.Word_list An efficient extensible dictionnary to parse words
  7. Pacomb.Regexp Regexp implementation above Pacomb.Lex
  8. Pacomb.Input build input buffer from string, channel, ...
  9. Pacomb.Charset an efficient representation of character sets

PPX syntax extension

Defining languages using the Pacomb.Grammar module directly is cumbersome. For that reason, PaComb provides a BNF-like PPX syntax extension. An example with arithmetic expressions is given below. It can be compiled with command ocamlfind ocamlopt -package pacomb,pacomb.ppx -o calc -linkpkg calc.ml (if it is written to a file calc.ml). This example an many others are available in the examples folder of the source distribution of opam.

type p = Atom | Prod | Sum
let%parser rec expr p =
  Atom < Prod < Sum
  ; (p=Atom) (x::FLOAT)                        => x
  ; (p=Atom) '(' (e::expr Sum) ')'             => e
  ; (p=Prod) (x::expr Prod) '*' (y::expr Atom) => x *. y
  ; (p=Prod) (x::expr Prod) '/' (y::expr Atom) => x /. y
  ; (p=Sum ) (x::expr Sum ) '+' (y::expr Prod) => x +. y
  ; (p=Sum ) (x::expr Sum ) '-' (y::expr Prod) => x -. y

Let us explain a bit this example, it is a BNF using priorities from type p. The first rule Atom < Prod < Sum says that expressions at Atom level are included in expression at Prod level which are themselves included in expression at Sum level.

Action are given after the infix symbol => . Rules are separated by semicolumn and here, they all start with a guard indicated at which priority level the rule applies. The terminals in the grammar are (x::FLOAT), which is a predefined grammar for floating point numbers, binding the variable x for use in action code. and '(', ')', '*', ... which parse a single character (string are allowed too). The non terminals are (x::expr Prod), (y::expr Atom), ... which recursively call the same grammar at the given priority level. Parentheses are mandatory as we use OCaml priority for _ :: _.

To use the above parser, you need to define a toplevel rule, say which characteres to ignore and call the parser on a file or string. For instance the following will work and handle parse error without exiting the program:

(* blanks, i.e. characteres to be ignored *)
let blank = Blank.from_charset (Charset.singleton ' ')

let _ =
  try
    while true do
      let f () =
        Printf.printf "=> %!"; (* prompt *)
        Grammar.parse_string (expr Sum) blank (input_line stdin)
      in
      (* [Pos] module provides a function to handle exception with
         an optional argument to call for error (default is to exit with
         code 1 *)
      Pos.handle_exception ~error:(fun _ -> ()) f ()
    done
  with
    End_of_file -> ()

The main function above is Grammar.parse_string which takes the grammar to be parsed, the definition of blanks (sequence of characteres to be ignored) and what to parse (here a string).

The BNF above works using the ppx extensions [%%parser ...] for structures and [%parser ...] for expression. These are also accessible by suffixing keywords with %parser as in the above example. These ppx extensions extends ocaml expressions with a syntax for grammars of type 'a Grammar.t and modifies the behaviour of let-bindings especially recursive ones to use declare_grammar, set_grammar and grammar_family. Recall that due to the limitation of ppx, we use a sub-syntax of OCaml expressions for grammars. It is therefore not a good idea to use "=>" as an infix inside [%parser ...]. Moreover, as we are reusing ocaml syntax, if you fall outside of the BNF below, the expression is unchanged and compiled by OCaml. This means that a syntax error in the description of the BNF results in general in a (strange) type error at compilation.

We give below the BNF grammar for the extension, together with a sketch of its semantics.

expr    ::= ...                                             Any Ocaml expression
grammar ::= rule                                                          itself
       | grammar ; rule                                              Grammar.alt
rule ::= qitems => expr                                   A rule with its action
       | expr < ... < expr                              priority order see below
       | ERROR("m")                        report an error if parsing fails here
       | ERROR(ls)                           report errors if parsing fails here
qitems ::= ()                                                      Grammar.empty
       | non_empty_qitems                                                 itself
       | guard non_empty_qitems                                  conditional rule
guard ::= expr bool_op expr
       | not expr
       | expr =| epat                                           pattern matching
bool_op ::= = | < | > | <= | >= | <> | == | != | && | ||       boolean operators
non_empty_qitems ::= qitem                                                itself
       | non_empty_qitems qitems                                     Grammar.seq
qitem ::= item                                                            itself
       | (epat :: item)                        give a name if used in the action
       | ((epat,epat) >: item)                            as above, but for dseq
       | (lazy (epat,epat) >: item)                  as above, but for lazy dseq
       | (epat <: item)                                   dseq with no real deps
item ::= CHAR                                           Grammar.term(Lex.any ())
       | CHAR(c) or 'c'                                 Grammar.term(Lex.char c)
       | CHARSET(s)                                  Grammar.term(Lex.charset s)
       | STRING(s) or "s"                             Grammar.term(Lex.string s)
       | UTF8                                      Grammar.term(Lex.any_utf8 ())
       | UTF8(s)                                        Grammar.term(Lex.utf8 s)
       | GRAPHEME                              Grammar.term(Lex.any_grapheme ())
       | GRAPHEME(s)                                Grammar.term(Lex.grapheme s)
       | EOF                                            Grammar.term(Lex.eof ())
       | RE(expr)             Grammar.term(Lex.regexp (Regexp.from_string expr))
       | NAT                                            Grammar.term(Lex.nat ())
       | INT                                            Grammar.term(Lex.int ())
       | FLOAT                                        Grammar.term(Lex.float ())
       | STRING_LIT                              Grammar.term(Lex.string_lit ())
       | CHAR_LIT                                  Grammar.term(Lex.char_lit ())
       | RE(expr)             Grammar.term(Lex.regexp (Regexp.from_string expr))
       | ~? expr                                             Grammar.option expr
       | ~? [expr] expr                         Grammar.option_default expr expr
       | ~* expr                                               Grammar.star expr
       | ~* [expr] expr                               Grammar.star_sep expr expr
       | ~+ expr                                               Grammar.plus expr
       | ~+ [expr] expr                               Grammar.plus_sep expr expr
       | other expr                                                       itself
epat ::= lid
       | __                                                        encoding of _
       | (lazy epat)
       | (epat : coretype)
       | epat = lid                                     encoding of [pat as lid]
       | (epat, ..., epat)
       | uid(epat)
       | M.epat

epat correspond to an encoding of patterns in expressions. Beware that _ is invalid, use __ instead.

Guard using (e =| epat) allows for rule garder by pattern matching. This has also been tested with GADT (see examples/calc_ext2.ml).

Action code needs parenthesis or begin ... end if it uses if .. then, let expressions, pattern matching or sequences.

Anything which does not correspond to this grammar will be unchanged in the ocaml code (like the type definition in the example above). A mutually recursive definition can also mix the definition of grammars (parametric of not) with the definition of normal ocaml values. This means you could put the whole file inside %%parser ....

Beware that inside the scope of the extension, you can use the syntax for grammars everywhere. This allows for some nesting as in:

type p = Atom | Prod | Sum
let%parser rec
     expr p = Atom < Prod < Sum
            ; (p=Atom) (x::FLOAT)                             => x
            ; (p=Atom) '(' (x::expr Sum) ')'                  => x
            ; (p=Prod) (x::expr Prod) => ( '*' (y::expr Atom) => x*.y
                                         ; '/' (y::expr Atom) => x/.y)
            ; (p=Sum ) (x::expr Sum ) => ( '+' (y::expr Prod) => x+.y
                                         ; '-' (y::expr Prod) => x-.y)

Remark: the left factorisation in this example is useless: in will be automatically performed by Pacomb.

Here is the meaning of let bindings for grammars through the ppx extension:

For recursive grammar with exactly one parameter, a rule p_1 < p_2 < ... < p_n will automatically add rules to include the grammar parametrized by p_i in the grammar parametrized by p_(i+1). This was used by the calculator example above.

let%parser accepts the following attribute:

Controling evaluation of action and right recursion

Action are evaluated as soon as the rule is reduced. This may be a problem for grammar whose prefix are ambiguous, even if the grammar is not really ambiguous. This is notably true for right recursion (which as usual should be avoided). If nothing is done, right recursion will be quadratic in time, while linear in space and time is exptected (linear space is the reason to prefer left recursion). To delay evaluation of action, you should use lazy action. For instance, the right recursive grammar for sexpr below:

let%parser rec sexp =
  (x::RE id)         => Idt x
; '(' (l::sexps) ')' => Lst l
and sexps =
  ()                   => []
; (e::sexp) (l::sexps) => (e::l)

Does not work efficiently because in an expression like "a b c ..." all the lists [Idt "a"], [Idt "a"; Idt "b"], etc are constructed. It should be rewritten using lazy:

let%parser rec sexp =
  (x::RE id)              => Idt x
; '(' (lazy l::sexps) ')' => Lst l
and sexps =
  ()                        => lazy []
; (e::sexp) (lazy l::sexps) => lazy (e::l)

or left recursion:

let%parser rec sexp =
  (x::RE id)         => Idt x
; '(' (l::sexps) ')' => Lst (List.rev l)
and sexps =
  ()                   => []
; (l::sexps) (e::sexp) => e::l

Remark: there are some important optimisation for lazy so use the lazy keyword as above as much as possible and do not use Lazy.from_fun or Lazy.force if you can avoid it.

A similar problem arises if you want to evaluate an action immediatly after a newline has been parsed. As Pacomb is trying to read the blank characteres after the newline, the evaluation of the action is retarted.

Remark: right recursion for sequence not using any semantics is optimized and works in O(1) memory. For instance, the following to parse a list of command with dependant grammar (hence right recursion is mandatory) works:

let%parser rec cmds env =
    ()                                          => ()
  ; (top_expr env) '\n' (cmds env)              => ()
  ; ((env,()) >: new_rule env) '\n' (cmds env)  => ()
  ; ((env,()) >: rem_rule env) '\n' (cmds env)  => ()

It works for one main reason: we do not use any semantics for any item in the rule. The first member of the pair for dependant grammar is used soon enough and does not need a stack.

For instance, if you do not want to use input_line as above for your calculator, you may want to write:

let%parser rec exprs = () => ()
                     ; exprs (x::expr Sum) '\n' => print_float x

This will not work and printing will not occur just after reading the newline. A solution is to test for newline without reading it to trigger printing:

let nl _ b i _ _ = let (c,_,_) = Input.read b i in c = '\n'
let%parser rec top_expr =
  (t::Grammar.test_after nl expr) => Printf.printf "%g\n=> %!" t
let%parser rec exprs = () => () ; exprs top_expr '\n' => ()

Dependent sequences

Variable binding in the left part of a rule are not available before the final action code. Therefore, it can not be used for selecting grammar rule. For instance, (p::prio) (x::g p) => x will report p as unbounded. To solve this, you can use dependent sequences, using (x,y)>:item will allow x (but not y) to be used both in the action and the rule. The separation with dependent (the par that can be used in the rest of the rume) and non dependent part (used only in the action) is crucial for efficiency. Indeed, dependent grammar are compiled to combinators at runtime and memoised and you don't want "noise" in the memoisation. This means you should really use values in a set as small as possible in dependent parts. Here is an example of grammar using this to implement an extensible calculator allowing to add extra infix operators with a given priority (see tests/calc_ext.ml):

let%parser rec expr prio =
  ((pe1,e1)>:expr prio) ((pop,b)>:op pe1 prio) ((__,e2)::expr pop)
                             => (pop, b e1 e2)
; (x::FLOAT)                 => (0.0,x)
; '(' (e::expr max_prio) ')' => (0.0,e)

where op pe1 prio parse binary operator with priorities between pe1 and prio. The action returns both the expression and its priority level.

It is important to know that dependant or self extensible grammars remain efficient. The main loss comes from the fact that with dependant grammars, combinator are composed at tun time are OCaml's inliner can not optimise them.

In principle, you can write a grammar bnf to parse your favorite syntax for BNF and use a rule (g,__)>:bnf (x::parse_my_bnf g) => x to parse any BNF! The function parse_my_bnf should call function in the Pacomb.Grammar module to build the usable grammar and as dependent grammar are memoised, each bnf will be only compiled once. This really gives the power to write self extensible language easily. This is illustrated in the example tests/calc_ext2.ml which accept the following input:

2
(4)
rule "suc" 1 : Str "S" Exp 1 => Cst 1 Op2 "+"
rule "pre" 1 : Str "P" Exp 1 => Cst 1 Op2 "-"
SPPS6
rule "pow" 2 : Exp <2 Str "^" Exp  2 => Op2 "^"
rule "mul" 3 : Exp  3 Str "*" Exp <3 => Op2 "*"
rule "div" 3 : Exp  3 Str "/" Exp <3 => Op2 "/"
rule "add" 4 : Exp  4 Str "+" Exp <4 => Op2 "+"
rule "sub" 4 : Exp  4 Str "-" Exp <4 => Op2 "-"
3*3
2+2
(2+3)*5/2-1
SPS6 * 3 + S1
remove rule "suc"
remove rule "pre"
2^2^2
2^(3^2)

Note: one can fairly well control when action are evaluated and therefore the parsing may depends upon global references that are modified by your parser itsel. However, purely functional code should be prefered, even for extensible grammars, because if for instance you have unwanted ambiguities, you will have a behavior hard to predict and a bug hard to diagnose.

Dependant sequences are also useful to prevent construction of infinite grammar. For instance in this code from examples/calc_ext2.ml:

let%parser rec rule : type a. a ty -> (env -> a Grammar.t) Grammar.t
  = fun t ->
    "Exp" (prio<:FLOAT) (r::rule (Arr(Flt,t))) =>
      (fun env -> (x::expr env (get_prio prio env)) (f::r env) => f x)
  ; "Str" (s<:STRING_LIT) (r::rule t) =>
      (fun env -> (STR s) (x::r env) => x)
  ; "=>" (a::action t) => (fun _ -> () => a)

The construction for the grammar for rule would loop in the "Exp" rule if we were not using <:. epat<:item is equivalent to ((),epat)>:((x::item) => ((),x)). This represent a grammar depending from (), i.e. a constant grammar. The only effect if to delay the construction of the grammar right of <: until it is used for the first time.

Cache and merge for ambiguous grammars

Pacomb can manage ambiguous grammars, but this will result in general in very poor performances and issue a warning: "Parsing ambiguity, use cache with merge" on stderr. You can use Pacomb.Grammar.parse_all_buffer to remove the warning, but you should preferably use cache and merge.

Pacomb.Grammar.cache can be apply to a grammar to produce a cached grammar. This guaranty that this grammar will at most be called once for each position in the parsed buffer. This will not solve ambiguity, but may improve performance. For instance, there are some non ambiguous grammars that Pacomb can not left factorise (left factorisation is not performed to the left of a dependent sequence). The annotation @cache on let%parser will automatically wrap the grammar with a call to Pacomb.Grammar.cache.

When your grammar is ambiguous, you can give an optional merge function to Pacomb.Grammar.cache. This function will be called to merge all results conrresponding to exactly the same portion of the input buffer with the cached grammar.

Here is a toy example (see example/catalan.ml):

let%parser [@merge (+.)] rec bin_seq =
    ()                              => 1.0
  ; (t1::bin_seq) 'a' (t2::bin_seq) => t1 *. t2

This grammar will parse any sequence of 'a' of length n and return the number of binary tree with n nodes (i.e. catalan number)! It does run in polynomial time (O(N^3) in fact, which is not the best that can be done).

In general, using cache and merge, you should be able to reach O(N^3) time complexity and O(N^2) space complexity for any BNF grammar. This seems confirmed by our benchmarks.

Getting positions

In action code (expression right of =>), a lid_lpos or lid_rpos will denote respectively the left and right position of the item named lid. a lid_pos will group both lid_lpos and lid_rpos in a record of type Pos.t * Pos.t. If the item is matched by a tuple and you want to use its position you must use pat = lid syntax to give a name to the whole item. The variables _lpos, _rpos and _pos corresponds to the position of the input parser by the whole rule.

Here is an example parsing sexpr with positions:

type sexp = { p: Pos.t * Pos.t; e : sexp' }
and sexp' =
  | Idt of string
  | Lst of sexp list

let id = "[a-zA-Z_][a-zA-Z_0-9]*[']*"

let%parser rec sexp
   =    ; (x::RE id)     => { p = _pos; e = Idt x }
   ; '(' (l::sexps) ')'  => { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
          ; (l::sexps) (e::sexp) => e::l

The type Pacomb.Pos.spos is a very light type to represent position. It contains a record of type Pacomb.Input.infos that is the same for the whole input and the position in bytes in the input of type Pacomb.Input.byte_pos. If you need line numbers and column number, the Pacomb.Pos module contains the necessary function to rescan the input and compute it (with a cache to avoid always rescanning from beginning).

This is important because in general positions are needed only in error messages and computing line and column number is costly, especially for utf8.

When parsing a stream that is not a regular file, the whole buffer is kept to allow rescanning. If your stream is very large, this is not reasonnable. Two solutions: restart parsing at regular interval, or use the ~rescan:false optionnal parameters and use only position in bytes.

Another problem with rescanning: if you use Grammar.parse_channel or Grammar.parse_fd, you can not get line and column number after closing the channel or file descriptor. However, if you use Grammar.parse_string or Grammar.parse_file you can. In this case, It is even possible to marshal (with closures) the type Pos.t. With Grammar.parse_file position will remain correct as long as the file is unchanged.

Producing nice error messages

The ERROR(m) syntax in the ppx extension allows to add well chosen error messages in your code. Here is a way to use it on the sexp grammar above:

let%parser rec sexp
   = ERROR(["id";"("])
   ; (x::RE id)         => { p = _pos; e = Idt x }
   ; '(' (l::sexps)     => (ERROR(")") | ')'
                        => { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
          ; (l::sexps) (e::sexp) => e::l

Parsing "a b (c + ..." with the above grammar will give

Parse error: Line 1, character 7.
expecting: ) id (

Banchmarks seems to show that adding error cost more or less 15% on the above example.

Limitations

Pacomb must eliminate left recursion in grammars in order to use combinators that would loop otherwise. However, left recursion is not supported if it traverses A Pacomb.Grammar.layout constructor to change blanks (probably possible to solve this, but probably not worth it).

Ambiguous grammar will be analysed in polynomial time if and only if you use a cache for all ambiguous non terminals of your grammar. Pacomb does perform some left factorisation, but it is not complete. Using cache is also a solution to that problem. Pacomb has function to print the grammar to see if left factorisation was performed.

Note: left recursion do not need and is not eliminated if the grammar uses a cache. However, this solution to use cache in general is too slow for non ambiguous grammars so we do not impose a cache to all left recursive grammars.

The ppx extension is not too bad but still suffers from the fact that it uses a sub-language of OCaml to describe grammars. For instance

let%parser g =((_,x)::g) => x]

is not legal because _ cannot be used in an Ocaml expression. Though the following works:

let%parser g = ((__,x)::g) => x

The syntax

(((x,y) = z) :: g) => (x,y,z_pos)

is not very nice as we use = to replace the as keyword and we also need a lot of parentheses.

A solution would be to rewrite OCaml's parser with pacomb and extend the language with real grammar extension!