org.w3c.tidy
Class EncodingUtils

java.lang.Object
  extended byorg.w3c.tidy.EncodingUtils

public final class EncodingUtils
extends java.lang.Object

Version:
$Revision: 1.7 $ ($Author: fgiust $)
Author:
Fabrizio Giustina

Nested Class Summary
(package private) static interface EncodingUtils.GetBytes
          Getter callback: called to retrieve 1 or more additional UTF-8 bytes.
(package private) static interface EncodingUtils.PutBytes
          Putter callbacks: called to store 1 or more additional UTF-8 bytes.
 
Field Summary
static int FSM_ASCII
          states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets.
static int FSM_ESC
          state ESC.
static int FSM_ESCD
          state ESCD.
static int FSM_ESCDP
          state ESCDP.
static int FSM_ESCP
          state ESCP.
static int FSM_NONASCII
          state NONASCII.
static int HIGH_UTF16_SURROGATE
          UTF-16 high surrogate.
static int LOW_UTF16_SURROGATE
          utf16 low surrogate.
private static int[] MAC2UNICODE
          John Love-Jensen contributed this table for mapping MacRoman character set to Unicode.
static int MAX_UTF16_FROM_UCS4
          Max UTF-16 value.
static int MAX_UTF8_FROM_UCS4
          Max UTF-88 valid char value.
private static int NUM_UTF8_SEQUENCES
          number of valid utf8 sequances.
private static int[] OFFSET_UTF8_SEQUENCES
          Offset for utf8 sequences.
private static int[] SYMBOL2UNICODE
          table to map symbol font characters to Unicode; undefined characters are mapped to 0x0000 and characters without any unicode equivalent are mapped to '?'.
static int UNICODE_BOM
          the default (big-endian) UNICODE BOM.
static int UNICODE_BOM_BE
          the big-endian (default) UNICODE BOM.
static int UNICODE_BOM_LE
          the little-endian UNICODE BOM.
static int UNICODE_BOM_UTF8
          the UTF-8 UNICODE BOM.
static int UTF16_HIGH_SURROGATE_BEGIN
          UTF-16 surrogate pair areas: high surrogates begin.
static int UTF16_HIGH_SURROGATE_END
          UTF-16 surrogate pair areas: high surrogates end.
static int UTF16_LOW_SURROGATE_BEGIN
          UTF-16 surrogate pair areas: low surrogates begin.
static int UTF16_LOW_SURROGATE_END
          UTF-16 surrogate pair areas: low surrogates end.
static int UTF16_SURROGATES_BEGIN
          UTF-16 surrogates begin.
private static int UTF8_BYTE_SWAP_NOT_A_CHAR
          UTF-8 bye swap: invalid char.
private static int UTF8_NOT_A_CHAR
          UTF-8 invalid char.
private static ValidUTF8Sequence[] VALID_UTF8
          Array of valid UTF8 sequences.
private static int[] WIN2UNICODE
          Mapping for Windows Western character set (128-159) to Unicode.
 
Constructor Summary
private EncodingUtils()
          don't instantiate.
 
Method Summary
protected static int decodeMacRoman(int c)
          Function to convert from MacRoman to Unicode.
(package private) static int decodeSymbolFont(int c)
          Function to convert from Symbol Font chars to Unicode.
(package private) static boolean decodeUTF8BytesToChar(int[] c, int firstByte, byte[] successorBytes, EncodingUtils.GetBytes getter, int[] count, int startInSuccessorBytesArray)
          Decodes an array of bytes to a char.
protected static int decodeWin1252(int c)
          Function for conversion from Windows-1252 to Unicode.
(package private) static boolean encodeCharToUTF8Bytes(int c, byte[] encodebuf, EncodingUtils.PutBytes putter, int[] count)
          Encode a char to an array of bytes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNICODE_BOM_BE

public static final int UNICODE_BOM_BE
the big-endian (default) UNICODE BOM.

See Also:
Constant Field Values

UNICODE_BOM

public static final int UNICODE_BOM
the default (big-endian) UNICODE BOM.

See Also:
Constant Field Values

UNICODE_BOM_LE

public static final int UNICODE_BOM_LE
the little-endian UNICODE BOM.

See Also:
Constant Field Values

UNICODE_BOM_UTF8

public static final int UNICODE_BOM_UTF8
the UTF-8 UNICODE BOM.

See Also:
Constant Field Values

FSM_ASCII

public static final int FSM_ASCII
states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets. The designators defined and used in ISO-2022-JP are: "ESC" + "(" + ? for ISO646 variants "ESC" + "$" + ? and "ESC" + "$" + "(" + ? for multibyte character sets. State ASCII.

See Also:
Constant Field Values

FSM_ESC

public static final int FSM_ESC
state ESC.

See Also:
Constant Field Values

FSM_ESCD

public static final int FSM_ESCD
state ESCD.

See Also:
Constant Field Values

FSM_ESCDP

public static final int FSM_ESCDP
state ESCDP.

See Also:
Constant Field Values

FSM_ESCP

public static final int FSM_ESCP
state ESCP.

See Also:
Constant Field Values

FSM_NONASCII

public static final int FSM_NONASCII
state NONASCII.

See Also:
Constant Field Values

MAX_UTF8_FROM_UCS4

public static final int MAX_UTF8_FROM_UCS4
Max UTF-88 valid char value.

See Also:
Constant Field Values

MAX_UTF16_FROM_UCS4

public static final int MAX_UTF16_FROM_UCS4
Max UTF-16 value.

See Also:
Constant Field Values

LOW_UTF16_SURROGATE

public static final int LOW_UTF16_SURROGATE
utf16 low surrogate.

See Also:
Constant Field Values

UTF16_SURROGATES_BEGIN

public static final int UTF16_SURROGATES_BEGIN
UTF-16 surrogates begin.

See Also:
Constant Field Values

UTF16_LOW_SURROGATE_BEGIN

public static final int UTF16_LOW_SURROGATE_BEGIN
UTF-16 surrogate pair areas: low surrogates begin.

See Also:
Constant Field Values

UTF16_LOW_SURROGATE_END

public static final int UTF16_LOW_SURROGATE_END
UTF-16 surrogate pair areas: low surrogates end.

See Also:
Constant Field Values

UTF16_HIGH_SURROGATE_BEGIN

public static final int UTF16_HIGH_SURROGATE_BEGIN
UTF-16 surrogate pair areas: high surrogates begin.

See Also:
Constant Field Values

UTF16_HIGH_SURROGATE_END

public static final int UTF16_HIGH_SURROGATE_END
UTF-16 surrogate pair areas: high surrogates end.

See Also:
Constant Field Values

HIGH_UTF16_SURROGATE

public static final int HIGH_UTF16_SURROGATE
UTF-16 high surrogate.

See Also:
Constant Field Values

UTF8_BYTE_SWAP_NOT_A_CHAR

private static final int UTF8_BYTE_SWAP_NOT_A_CHAR
UTF-8 bye swap: invalid char.

See Also:
Constant Field Values

UTF8_NOT_A_CHAR

private static final int UTF8_NOT_A_CHAR
UTF-8 invalid char.

See Also:
Constant Field Values

WIN2UNICODE

private static final int[] WIN2UNICODE
Mapping for Windows Western character set (128-159) to Unicode.


MAC2UNICODE

private static final int[] MAC2UNICODE
John Love-Jensen contributed this table for mapping MacRoman character set to Unicode.


SYMBOL2UNICODE

private static final int[] SYMBOL2UNICODE
table to map symbol font characters to Unicode; undefined characters are mapped to 0x0000 and characters without any unicode equivalent are mapped to '?'. Is this appropriate?


VALID_UTF8

private static final ValidUTF8Sequence[] VALID_UTF8
Array of valid UTF8 sequences.


NUM_UTF8_SEQUENCES

private static final int NUM_UTF8_SEQUENCES
number of valid utf8 sequances.


OFFSET_UTF8_SEQUENCES

private static final int[] OFFSET_UTF8_SEQUENCES
Offset for utf8 sequences.

Constructor Detail

EncodingUtils

private EncodingUtils()
don't instantiate.

Method Detail

decodeWin1252

protected static int decodeWin1252(int c)
Function for conversion from Windows-1252 to Unicode.

Parameters:
c - char to decode
Returns:
decoded char

decodeMacRoman

protected static int decodeMacRoman(int c)
Function to convert from MacRoman to Unicode.

Parameters:
c - char to decode
Returns:
decoded char

decodeSymbolFont

static int decodeSymbolFont(int c)
Function to convert from Symbol Font chars to Unicode.

Parameters:
c - char to decode
Returns:
decoded char

decodeUTF8BytesToChar

static boolean decodeUTF8BytesToChar(int[] c,
                                     int firstByte,
                                     byte[] successorBytes,
                                     EncodingUtils.GetBytes getter,
                                     int[] count,
                                     int startInSuccessorBytesArray)
Decodes an array of bytes to a char.

Parameters:
c - will contain the decoded char
firstByte - first input byte
successorBytes - array containing successor bytes (can be null if a getter is provided).
getter - callback used to get new bytes if successorBytes doesn't contain enough bytes
count - will contain the number of bytes read
startInSuccessorBytesArray - starting offset for bytes in successorBytes
Returns:
true if error

encodeCharToUTF8Bytes

static boolean encodeCharToUTF8Bytes(int c,
                                     byte[] encodebuf,
                                     EncodingUtils.PutBytes putter,
                                     int[] count)
Encode a char to an array of bytes.

Parameters:
c - char to encode
encodebuf - will contain the decoded bytes
putter - if not null it will be called to write bytes to out
count - number of bytes written
Returns:
false= ok, true= error