Unit regex


This library unit provides support for regular expressions. The regular expression package used is irregex written by Alex Shinn. Irregex supports most Perl-extensions and is written completely in Scheme.
This library unit exposes two APIs: the standard Chicken API described below, and the original irregex API. You may use either API or both:
 (require-library regex)   ; required for either API, or both
 (import regex)            ; import the Chicken regex API
 (import irregex)          ; import the original irregex API

Regular expressions may be either POSIX-style strings (with most PCRE extensions) or an SCSH-style SRE. There is no (rx ...) syntax - just use normal Scheme lists, with quasiquote if you like.

grep


 [procedure] (grep REGEX LIST)

Returns all items of LIST that match the regular expression REGEX. This procedure could be defined as follows:
(define (grep regex lst)
  (filter (lambda (x) (string-search regex x)) lst) )


glob->regexp


 [procedure] (glob->regexp PATTERN)

Converts the file-pattern PATTERN into a regular expression.
(glob->regexp "foo.*")
=> "foo\..*"

PATTERN should follow "glob" syntax. Allowed wildcards are
 *
 [C...]
 [C1-C2]
 [-C...]
 ?


glob?


 [procedure] (glob? STRING)

Does the STRING have any "glob" wildcards?
A string without any "glob" wildcards does not meet the criteria, even though it technically is a valid "glob" file-pattern.

regexp


 [procedure] (regexp STRING [IGNORECASE [IGNORESPACE [UTF8]]])

Returns a precompiled regular expression object for string. The optional arguments IGNORECASE, IGNORESPACE and UTF8 specify whether the regular expression should be matched with case- or whitespace-differences ignored, or whether the string should be treated as containing UTF-8 encoded characters, respectively.
Note that code that uses regular expressions heavily should always use them in precompiled form, which is likely to be much faster than passing strings to any of the regular-expression routines described below.

regexp?


 [procedure] (regexp? X)

Returns #t if X is a precompiled regular expression, or #f otherwise.

string-match

string-match-positions


 [procedure] (string-match REGEXP STRING [START])
 [procedure] (string-match-positions REGEXP STRING [START])

Matches the regular expression in REGEXP (a string or a precompiled regular expression) with STRING and returns either #f if the match failed, or a list of matching groups, where the first element is the complete match. If the optional argument START is supplied, it specifies the starting position in STRING. For each matching group the result-list contains either: #f for a non-matching but optional group; a list of start- and end-position of the match in STRING (in the case of string-match-positions); or the matching substring (in the case of string-match). Note that the exact string is matched. For searching a pattern inside a string, see below. Note also that string-match is implemented by calling string-search with the regular expression wrapped in ^ ... $. If invoked with a precompiled regular expression argument (by using regexp), string-match is identical to string-search.

string-search

string-search-positions


 [procedure] (string-search REGEXP STRING [START [RANGE]])
 [procedure] (string-search-positions REGEXP STRING [START [RANGE]])

Searches for the first match of the regular expression in REGEXP with STRING. The search can be limited to RANGE characters.

string-split-fields


 [procedure] (string-split-fields REGEXP STRING [MODE [START]])

Splits STRING into a list of fields according to MODE, where MODE can be the keyword #:infix (REGEXP matches field separator), the keyword #:suffix (REGEXP matches field terminator) or #t (REGEXP matches field), which is the default.
(define s "this is a string 1, 2, 3,")

(string-split-fields "[^ ]+" s)

  => ("this" "is" "a" "string" "1," "2," "3,")

(string-split-fields " " s #:infix)

  => ("this" "is" "a" "string" "1," "2," "3,")

(string-split-fields "," s #:suffix)
 
  => ("this is a string 1" " 2" " 3")


string-substitute


 [procedure] (string-substitute REGEXP SUBST STRING [MODE])

Searches substrings in STRING that match REGEXP and substitutes them with the string SUBST. The substitution can contain references to subexpressions in REGEXP with the \NUM notation, where NUM refers to the NUMth parenthesized expression. The optional argument MODE defaults to 1 and specifies the number of the match to be substituted. Any non-numeric index specifies that all matches are to be substituted.
(string-substitute "([0-9]+) (eggs|chicks)" "\\2 (\\1)" "99 eggs or 99 chicks" 2)
=> "99 eggs or chicks (99)"

Note that a regular expression that matches an empty string will signal an error.

string-substitute*


 [procedure] (string-substitute* STRING SMAP [MODE])

Substitutes elements of STRING with string-substitute according to SMAP. SMAP should be an association-list where each element of the list is a pair of the form (MATCH . REPLACEMENT). Every occurrence of the regular expression MATCH in STRING will be replaced by the string REPLACEMENT
(string-substitute* "<h1>Hello, world!</h1>" '(("<[/A-Za-z0-9]+>" . "")))

=>  "Hello, world!"


regexp-escape


 [procedure] (regexp-escape STRING)

Escapes all special characters in STRING with \, so that the string can be embedded into a regular expression.
(regexp-escape "^[0-9]+:.*$")
=>  "\\^\\[0-9\\]\\+:.\n.\\*\\$"

Extended SRE Syntax


The following table summarizes the SRE syntax, with detailed explanations following.
  ;; basic patterns
  <string>                          ; literal string
  (seq <sre> ...)                   ; sequence
  (: <sre> ...)
  (or <sre> ...)                    ; alternation
  
  ;; optional/multiple patterns
  (? <sre> ...)                     ; 0 or 1 matches
  (* <sre> ...)                     ; 0 or more matches
  (+ <sre> ...)                     ; 1 or more matches
  (= <n> <sre> ...)                 ; exactly <n> matches
  (>= <n> <sre> ...)                ; <n> or more matches
  (** <from> <to> <sre> ...)        ; <n> to <m> matches
  (?? <sre> ...)                    ; non-greedy (non-greedy) pattern: (0 or 1)
  (*? <sre> ...)                    ; non-greedy kleene star
  (**? <from> <to> <sre> ...)       ; non-greedy range
  
  ;; submatch patterns
  (submatch <sre> ...)              ; numbered submatch
  (submatch-named <name> <sre> ...) ; named submatch
  (=> <name> <sre> ...)
  (backref <n-or-name>)             ; match a previous submatch
  
  ;; toggling case-sensitivity
  (w/case <sre> ...)                ; enclosed <sre>s are case-sensitive
  (w/nocase <sre> ...)              ; enclosed <sre>s are case-insensitive
  
  ;; character sets
  <char>                            ; singleton char set
  (<string>)                        ; set of chars
  (or <cset-sre> ...)               ; set union
  (~ <cset-sre> ...)                ; set complement (i.e. [^...])
  (- <cset-sre> ...)                ; set difference
  (& <cset-sre> ...)                ; set intersection
  (/ <range-spec> ...)              ; pairs of chars as ranges
  
  ;; named character sets
  any
  nonl
  ascii
  lower-case     lower
  upper-case     upper
  alphabetic     alpha
  numeric        num
  alphanumeric   alphanum  alnum
  punctuation    punct
  graphic        graph
  whitespace     white     space
  printing       print
  control        cntrl
  hex-digit      xdigit
  
  ;; assertions and conditionals
  bos eos                           ; beginning/end of string
  bol eol                           ; beginning/end of line
  bow eow                           ; beginning/end of word
  nwb                               ; non-word-boundary
  (look-ahead <sre> ...)            ; zero-width look-ahead assertion
  (look-behind <sre> ...)           ; zero-width look-behind assertion
  (neg-look-ahead <sre> ...)        ; zero-width negative look-ahead assertion
  (neg-look-behind <sre> ...)       ; zero-width negative look-behind assertion
  (atomic <sre> ...)                ; for (?>...) independent patterns
  (if <test> <pass> [<fail>])       ; conditional patterns
  commit                            ; don't backtrack beyond this (i.e. cut)
  
  ;; backwards compatibility
  (posix-string <string>)           ; embed a POSIX string literal

Basic SRE Patterns


The simplest SRE is a literal string, which matches that string exactly.
  (string-search "needle" "hayneedlehay") => <match>

By default the match is case-sensitive, though you can control this either with the compiler flags or local overrides:
  (string-search "needle" "haynEEdlehay") => #f
  
  (string-search (irregex "needle" 'i) "haynEEdlehay") => <match>
  
  (string-search '(w/nocase "needle") "haynEEdlehay") => <match>

You can use w/case to switch back to case-sensitivity inside a w/nocase:
  (string-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => <match>
  
  (string-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f

Of course, literal strings by themselves aren't very interesting regular expressions, so we want to be able to compose them. The most basic way to do this is with the seq operator (or its abbreviation :), which matches one or more patterns consecutively:
  (string-search '(: "one" space "two" space "three") "one two three") => <match>

As you may have noticed above, the w/case and w/nocase operators allowed multiple SREs in a sequence - other operators that take any number of arguments (e.g. the repetition operators below) allow such implicit sequences.
To match any one of a set of patterns use the or alternation operator:
  (string-search '(or "eeney" "meeney" "miney") "meeney") => <match>

(string-search '(or "eeney" "meeney" "miney") "moe") => #f

SRE Repetition Patterns


There are also several ways to control the number of times a pattern is matched. The simplest of these is ? which just optionally matches the pattern:
  (string-search '(: "match" (? "es") "!") "matches!") => <match>
  
  (string-search '(: "match" (? "es") "!") "match!") => <match>
  
  (string-search '(: "match" (? "es") "!") "matche!") => #f

To optionally match any number of times, use *, the Kleene star:
  (string-search '(: "<" (* (~ #\>)) ">") "<html>") => <match>
  
  (string-search '(: "<" (* (~ #\>)) ">") "<>") => <match>
  
  (string-search '(: "<" (* (~ #\>)) ">") "<html") => #f

Often you want to match any number of times, but at least one time is required, and for that you use +:
  (string-search '(: "<" (+ (~ #\>)) ">") "<html>") => <match>
  
  (string-search '(: "<" (+ (~ #\>)) ">") "<a>") => <match>
  
  (string-search '(: "<" (+ (~ #\>)) ">") "<>") => #f

More generally, to match at least a given number of times, use >=:
  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => <match>

(string-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => <match>
(string-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
To match a specific number of times exactly, use {=}:
  (string-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => <match>
  
  (string-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f

And finally, the most general form is ** which specifies a range of times to match. All of the earlier forms are special cases of this.
  (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => <match>

(string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
There are also so-called "non-greedy" variants of these repetition operators, by convention suffixed with an additional ?. Since the normal repetition patterns can match any of the allotted repetition range, these operators will match a string if and only if the normal versions matched. However, when the endpoints of which submatch matched where are taken into account (specifically, all matches when using string-search since the endpoints of the match itself matter), the use of a non-greedy repetition can change the result.
So, whereas ? can be thought to mean "match or don't match," ?? means "don't match or match." * typically consumes as much as possible, but *? tries first to match zero times, and only consumes one at a time if that fails. If you have a greedy operator followed by a non-greedy operator in the same pattern, they can produce surprisins results as they compete to make the match longer or shorter. If this seems confusing, that's because it is. Non-greedy repetitions are defined only in terms of the specific backtracking algorithm used to implement them, which for compatibility purposes always means the Perl algorithm. Thus, when using these patterns you force IrRegex to use a backtracking engine, and can't rely on efficient execution.

SRE Character Sets


Perhaps more common than matching specific strings is matching any of a set of characters. You can use the or alternation pattern on a list of single-character strings to simulate a character set, but this is too clumsy for everyday use so SRE syntax allows a number of shortcuts.
A single character matches that character literally, a trivial character class. More conveniently, a list holding a single element which is a string refers to the character set composed of every character in the string.
  (string-match '(* #\-) "---") => <match>
  
  (string-match '(* #\-) "-_-") => #f
  
  (string-match '(* ("aeiou")) "oui") => <match>
  
  (string-match '(* ("aeiou")) "ouais") => #f

Ranges are introduced with the / operator. Any strings or characters in the / are flattened and then taken in pairs to represent the start and end points, inclusive, of character ranges.
  (string-match '(* (/ "AZ09")) "R2D2") => <match>
  
  (string-match '(* (/ "AZ09")) "C-3PO") => #f

In addition, a number of set algebra operations are provided. or, of course, has the same meaning, but when all the options are character sets it can be thought of as the set union operator. This is further extended by the & set intersection, - set difference, and ~ set complement operators.
  (string-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => <match>
  
  (string-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f

(string-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => <match> (string-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f

SRE Assertion Patterns


There are a number of times it can be useful to assert something about the area around a pattern without explicitly making it part of the pattern. The most common cases are specifically anchoring some pattern to the beginning or end of a word or line or even the whole string. For example, to match on the end of a word:
  (string-match '(: "foo" eow) "foo") => <match>
  
  (string-match '(: "foo" eow) "foo!") => <match>
  
  (string-match '(: "foo" eow) "foof") => #f

The bow, bol, eol, bos and eos work similarly. nwb asserts that you are not in a word-boundary - if replaced for eow in the above examples it would reverse all the results.
There is no wb, since you tend to know from context whether it would be the beginning or end of a word, but if you need it you can always use (or bow eow).
Somewhat more generally, Perl introduced positive and negative look-ahead and look-behind patterns. Perl look-behind patterns are limited to a fixed length, however the IrRegex versions have no such limit.
  (string-match '(: "regular" (look-ahead " expression")) "regular expression") => <match>

The most general case, of course, would be an and pattern to complement the or pattern - all the patterns must match or the whole pattern fails. This may be provided in a future release, although it (and look-ahead and look-behind assertions) are unlikely to be compiled efficiently.


Previous: Unit extras
Next: Unit srfi-1