Next: XML Support, Previous: Parser Buffers, Up: Input/Output
Although it is possible to write parsers using the parser-buffer abstraction (see Parser Buffers), it is tedious. The problem is that the abstraction isn't closely matched to the way that people think about syntactic structures. In this section, we introduce a higher-level mechanism that greatly simplifies the implementation of a parser.
The parser language described here allows the programmer to write BNF-like specifications that are translated into efficient Scheme code at compile time. The language is declarative, but it can be freely mixed with Scheme code; this allows the parsing of grammars that aren't conveniently described in the language.
The language also provides backtracking. For example, this expression matches any sequence of alphanumeric characters followed by a single alphabetic character:
(*matcher (seq (* (char-set char-set:alphanumeric)) (char-set char-set:alphabetic)))
The way that this works is that the matcher matches alphanumeric characters in the input stream until it finds a non-alphanumeric character. It then tries to match an alphabetic character, which of course fails. At this point, if it matched at least one alphanumeric character, it backtracks: the last matched alphanumeric is “unmatched”, and it again attempts to match an alphabetic character. The backtracking can be arbitrarily deep; the matcher will continue to back up until it finds a way to match the remainder of the expression.
So far, this sounds a lot like regular-expression matching (see Regular Expressions). However, there are some important differences.
Here is an example that shows off several of the features of the parser language. The example is a parser for XML start tags:
(*parser (with-pointer p (seq "<" parse-name parse-attribute-list (alt (match ">") (match "/>") (sexp (lambda (b) (error (string-append "Unterminated start tag at " (parser-buffer-position-string p)))))))))
This shows that the basic description of a start tag is very similar
to its BNF. Non-terminal symbols parse-name
and
parse-attribute-list
do most of the work, and the noise strings
"<"
and ">"
are the syntactic markers delimiting the
form. There are two alternate endings for start tags, and if the
parser doesn't find either of the endings, the Scheme code (wrapped in
sexp
) is run to signal an error. The error procedure
perror
takes a pointer p
, which it uses to indicate the
position in the input stream at which the error occurred. In this
case, that is the beginning of the start tag, i.e. the position of
the leading "<"
marker.
This example still looks pretty complicated, mostly due to the error-signalling code. In practice, this is abstracted into a macro, after which the expression is quite succinct:
(*parser (bracket "start tag" (seq (noise (string "<")) parse-name) (match (alt (string ">") (string "/>"))) parse-attribute-list))
The bracket
macro captures the pattern of a bracketed item, and
hides much of the detail.
The parser language actually consists of two languages: one for defining matchers, and one for defining parsers. The languages are intentionally very similar, and are meant to be used together. Each sub-language is described below in its own section.
The parser language is a run-time-loadable option; to use it, execute
(load-option '*parser)