Most regular expression matchers provide a number of built-in character classes, most commonly the classes defined by the POSIX standard. Some provide more than one set: Java, for example, provides the POSIX classes, the Unicode classes, and a set of classes resembling the POSIX classes but differing from them in some respects.
The Unicode classes are based on the Unicode General Character Properties. A list of these categories and the abbreviations used for them is provided on the Help menu. The classification of every character can be found in the file UnicodeData.txt distributed by the Unicode Consortium.
The POSIX classes are defined very concretely for the ASCII character set. Outside of ASCII they are defined only by certain general principles. Implementors have some discretion in how they classify characters. Here is an ASCII chart color-coded to show the basic POSIX classes.
00 | nul | 01 | soh | 02 | stx | 03 | etx | 04 | eot | 05 | enq | 06 | ack | 07 | bel |
08 | bs | 09 | tab | 0a | nl | 0b | vt | 0c | ff | 0d | cr | 0e | so | 0f | si |
10 | dle | 11 | dc1 | 12 | dc2 | 13 | dc3 | 14 | dc4 | 15 | nak | 16 | syn | 17 | etb |
18 | can | 19 | em | 1a | sub | 1b | esc | 1c | fs | 1d | gs | 1e | rs | 1f | us |
20 | sp | 21 | ! | 22 | " | 23 | # | 24 | $ | 25 | % | 26 | & | 27 | ' |
28 | ( | 29 | ) | 2a | * | 2b | + | 2c | , | 2d | - | 2e | . | 2f | / |
30 | 0 | 31 | 1 | 32 | 2 | 33 | 3 | 34 | 4 | 35 | 5 | 36 | 6 | 37 | 7 |
38 | 8 | 39 | 9 | 3a | : | 3b | ; | 3c | < | 3d | = | 3e | > | 3f | ? |
40 | @ | 41 | A | 42 | B | 43 | C | 44 | D | 45 | E | 46 | F | 47 | G |
48 | H | 49 | I | 4a | J | 4b | K | 4c | L | 4d | M | 4e | N | 4f | O |
50 | P | 51 | Q | 52 | R | 53 | S | 54 | T | 55 | U | 56 | V | 57 | W |
58 | X | 59 | Y | 5a | Z | 5b | [ | 5c | \ | 5d | ] | 5e | ^ | 5f | _ |
60 | ` | 61 | a | 62 | b | 63 | c | 64 | d | 65 | e | 66 | f | 67 | g |
68 | h | 69 | i | 6a | j | 6b | k | 6c | l | 6d | m | 6e | n | 6f | o |
70 | p | 71 | q | 72 | r | 73 | s | 74 | t | 75 | u | 76 | v | 77 | w |
78 | x | 79 | y | 7a | z | 7b | { | 7c | | | 7d | } | 7e | ~ | 7f | del |
The color coding of the basic POSIX classes is as follows:
Control Characters | [:cntrl:] |
Space | |
Punctuation | [:punct:] |
Digits | [:digit:] |
Upper Case Letters | [:upper:] |
Lower Case Letters | [:lower:] |
Notice that the space character stands on its own and is not included in any basic class.
Most of the control characters should not appear in normal text. The ones that are likely to are:
0x09 | TAB | horizontal tab |
0x0A | NL | newline/linefeed |
0x0D | CR | carriage return |
The usual derived classes are as follows.
Class | Definition |
---|---|
[:alpha:] | [:upper:] ∪ [:lower:] |
[:alnum:] | [:alpha:] ∪ [:digit:] |
[:xdigit:] | [:digit:] ∪ [AaBbCcDdEeFf] |
[:graph:] | [:alnum:] ∪ [:punct:] |
[:print:] | [:graph:] ∪ Space |
[:blank:] | Space ∪ Tab |
[:space:] | [:blank:] ∪ [NL VT FF CR] |
[:word:] | [:alnum:] ∪ Underscore |
All but [:word:] are defined in the POSIX standard. [:word:] is not a POSIX class (pace the bash manual) but reflects the fact that in quite a few programming languages the characters in this class are those permitted in identifiers.
The principle governing the classification of characters outside the ASCII range is that the structure of the system as applied to ASCII must be maintained, except that additional classes may be created. The rules for the derived classes must continue to hold, and the basic classes must remain disjoint.