The standard C library environment provides various routines that will format a time from the internal binary representation to a textual representation. What it lacks, though, is a routine that does the opposite: Parsing a textual date and time representation into a binary representation. This is exactly what the parser provided in this example does.
The rfcdate_parser
class understands all date and time specifications
described in section 5 of RFC 822. This does not
include many other popular syntaxes, such as the ISO format, the ASN.1 format, etc., but using
this class as an example, it should be trivial to write appropriate parsers for these formats
as well.
Aside from being outright useful, the rfcdate_parser
class is intended to
serve as a (more complex) example of how to use the Spirit parser framework. Thus, this
document has been written more as a tutorial than as a reference documentation, in the hope
that it will help new users understand how to apply Spirit to similar problems.
The RFC format for date and time specifications is originally defined in RFC 822 and has since then be
re-used in many RFC formats and protocols. The exact specification in the RFC's augmented
BNF
is as follows:
date-time = [ day "," ] date time day = "Mon" | "Tue" | "Wed" | "Thu" | "Fri" | "Sat" | "Sun" date = 1*2DIGIT month 2DIGIT month = "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec" time = hour zone hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT] zone = "UT" | "GMT" | "EST" | "EDT" | "CST" | "CDT" | "MST" | "MDT" | "PST" | "PDT" | 1ALPHA | ( ("+" | "-") 4DIGIT )
The syntax actually understood by the rfcdate_parser
class varies from this
grammar in three points:
time
rule is optional. If omitted, 00:00is assumed.
time
rule, the zone
rule is optional. If
omitted, UTCis assumed.
Concerning the specification of the date's year: A two-digit year XY
is interpreted as 19XY
; everything else is taken literally. Hence, the parser
will understand a date such as 1 Jan 1312
, even though you system is probably not able
to handle that date correctly, because it cannot be expressed as a time_t
(seconds
since 1 Jan 1970). Thus: Be careful to check for errors when dealing with such dates.
At first sight, this doesn't look too unreasonable, but unfortunately, a few section earlier, the RFC goes and states that any atom may be delimited by either white space (space or tab), continued linear whitespace (carriage return + newline + white space), or comments (pretty much anything in brackets). Furthermore, comments may nest, any character may be escaped, and so on and so forth. In effect, this means that the rather sane input
12 Jun 82
is identical to the rather insane input:
12 (\(( This is a nested comment\), still), and still) Jun (hehe) 82
Of course, it is almost impossible to specify an EBNF that parses such a thing -- which is exactly why the RFC does not and why most parsers do, in fact, not handle it.
Using Spirit, though, parsing this beast is astonshingly easy; you just have to split the functionality into an actual parser and a skipper. If you want to find out how, read on ...
rfc_skipper
classThe most complicated part of parsing anything that is based on the grammar defined in RFC822 is the crazy comment and line continuation syntax. Once you have that out of your way, the rest is rather simple. Fortunately, Spirit provides a great mechanism that solves this problem altogether for us: The skipper. A skipper is basically a parser that will be applied every time a token of the actual grammar has matched. If the skipper matches the input following the token, all matching characters will be skipped. That is, the real parser will not see them.
Thus, if you want to parse a sequence of numbers separated by blanks, like this:
input = number ( " " number )*
You can either write the parser accordinly, expecting those blanks, or you can you say
input = number ( number )*
and combine it with a skipper that will match a blank, such as
spirit::space_p
. (By the way: If you want to disable the skipper in certain parts
of the grammar, which have to be parsed litarally, you can wrap them in a
spirit::lexeme_d
directive.)
Thus, once we have a skipper that skips all that comment-junk for us, parsing the actual contents will be much easier. Here is the code:
struct rfc_skipper : public spirit::grammar<rfc_skipper> { rfc_skipper() { } template<typename scannerT> struct definition { definition(const rfc_skipper& self) { using namespace spirit; first = ( junk = lwsp | comment lwsp = +( !str_p("\r\n") >> chset_p(" \t") ), comment = ch_p('(') >> *( lwsp | ctext | qpair | comment ) >> ')', ctext = anychar_p - chset_p("()\\\r"), qpair = ch_p('\\') >> anychar_p ); } const spirit::rule<scannerT>& start() const { return first; } spirit::subrule<0> junk; spirit::subrule<1> lwsp; spirit::subrule<2> comment; spirit::subrule<3> ctext; spirit::subrule<4> qpair; spirit::rule<scannerT> first; }; }; const rfc_skipper rfc_skipper_p;
As you can see, the skipper will match anything that is not an actual token according to the RFC:
linear white spaceis somewhat misleading. The RFC defines this as an end of line (\r\n) followed by at least one white space character (space or \t). This is also known as a
continued line.
Having a generic RFC-style skipper available is -- by the way -- much more useful than just for parsing dates! Consider the case where you want to know the actual address stated in an e-mail or a news posting. Then you could use the mini-parser
char weird[] = \ "From: (Some \r\n" \ " comment) simons (stuff) \r\n" \ " @ computer (inserted) . (between) org(tokens)"; string output; parse(weird, ( str_p("From:") >> *( anychar_p [append(output)] ) ), rfc_skipper_p); cout << "Stripped address is: '" << output << "'" << endl; assert(output == "simons@computer.org");
to get rid of the comments and white space -- the result being an address in the canonic representation. (Of course, RFC822 address lines are much more complicated than this ... Consider this to be an example.)
rfcdate_parser
class (and helpers)