Builtin Character Classes

Most regular expression matchers provide a number of built-in character classes, most commonly the classes defined by the POSIX standard. Some provide more than one set: Java, for example, provides the POSIX classes, the Unicode classes, and a set of classes resembling the POSIX classes but differing from them in some respects.

The Unicode classes are based on the Unicode General Character Properties. A list of these categories and the abbreviations used for them is provided on the Help menu. The classification of every character can be found in the file UnicodeData.txt distributed by the Unicode Consortium.

The POSIX classes are defined very concretely for the ASCII character set. Outside of ASCII they are defined only by certain general principles. Implementors have some discretion in how they classify characters. Here is an ASCII chart color-coded to show the basic POSIX classes.


00nul01soh02stx03etx04eot05enq06ack07bel
08bs09tab0anl0bvt0cff0dcr0eso0fsi
10dle11dc112dc213dc314dc415nak16syn17etb
18can19em1asub1besc1cfs1dgs1ers1fus
20sp21!22"23#24$25%26&27'
28(29)2a*2b+2c,2d-2e.2f/
300311322333344355366377
3883993a:3b;3c<3d=3e>3f?
40@41A42B43C44D45E46F47G
48H49I4aJ4bK4cL4dM4eN4fO
50P51Q52R53S54T55U56V57W
58X59Y5aZ5b[5c\5d]5e^5f_
60`61a62b63c64d65e66f67g
68h69i6aj6bk6cl6dm6en6fo
70p71q72r73s74t75u76v77w
78x79y7az7b{7c|7d}7e~7fdel

The color coding of the basic POSIX classes is as follows:

Control Characters[:cntrl:]
Space 
Punctuation[:punct:]
Digits[:digit:]
Upper Case Letters[:upper:]
Lower Case Letters[:lower:]

Notice that the space character stands on its own and is not included in any basic class.

Most of the control characters should not appear in normal text. The ones that are likely to are:

0x09TABhorizontal tab
0x0ANLnewline/linefeed
0x0DCRcarriage return

The usual derived classes are as follows.

ClassDefinition
[:alpha:][:upper:] ∪ [:lower:]
[:alnum:][:alpha:] ∪ [:digit:]
[:xdigit:][:digit:] ∪ [AaBbCcDdEeFf]
[:graph:][:alnum:] ∪ [:punct:]
[:print:][:graph:] ∪ Space
[:blank:] Space ∪ Tab
[:space:][:blank:] ∪ [NL VT FF CR]
[:word:][:alnum:] ∪ Underscore

All but [:word:] are defined in the POSIX standard. [:word:] is not a POSIX class (pace the bash manual) but reflects the fact that in quite a few programming languages the characters in this class are those permitted in identifiers.

The principle governing the classification of characters outside the ASCII range is that the structure of the system as applied to ASCII must be maintained, except that additional classes may be created. The rules for the derived classes must continue to hold, and the basic classes must remain disjoint.


Next

Back to Table of Contents