|
This text describes the functions rtl_convertTextToUnicode()
and rtl_convertUnicodeToText(), the meaning of all the
accompanying RTL_TEXTTOUNICODE_FLAGS_XXX,
RTL_TEXTTOUNICODE_INFO_XXX,
RTL_UNICODETOTEXT_FLAGS_XXX and
RTL_UNICODETOTEXT_INFO_XXX flags, and the conversion
context conventions.
It is valid to pass a null pointer instead of an
rtl_TextToUnicodeContext or rtl_UnicodeToTextContext
to the conversion functions. In that case, the functions behave as if they
received an initial context, as obtained by
rtl_createTextToUnicodeContext(),
rtl_resetTextToUnicodeContext(),
rtl_createUnicodeToTextContext(), or
rtl_resetUnicodeToTextContext(), and simply do not return any
context information (which is effectively lost). This implies that you should
always specify the FLAGS_FLUSH flag when using a null context,
for otherwise it is not possible in general to find out whether the input
buffer has been completely converted.
An undefined code is any of the following:
0xA5 in ISO 8859-3,
0xA2A1 in EUC-CN, and 0x167F in Unicode.In the text-to-Unicode direction, the conversion functions distinguish
between single-byte and multi-byte undefined codes (0xA5 in
ISO 8859-3 and 0x80 in GB-18030 are single-byte undefined codes,
while 0xA2A1 in EUC-CN and 0xFE39FE39 in GB-18030
are multi-byte undefined codes.)
When encountering an undefined code, the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_UNDEFINED_ERRORFLAGS_MBUNDEFINED_ERRORINFO_UNDEFINED or INFO_MBUNDEFINED and the
INFO_ERROR flags, and immediately quit the conversion
(ignoring any FLAGS_FLUSH flag).FLAGS_UNDEFINED_IGNOREFLAGS_MBUNDEFINED_IGNOREINFO_UNDEFINED or INFO_MBUNDEFINED flag, and
continue with the conversion.FLAGS_UNDEFINED_MAPTOPRIVATEINFO_UNDEFINED
flag, write U+F1xx into the output buffer (where
xx is the single-byte undefined code), and
continue with the conversion.FLAGS_UNDEFINED_0INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII NUL character
(0x00) into the output buffer, and continue with the
conversion.FLAGS_UNDEFINED_QUESTIONMARKINFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “?”
character (0x3F) into the output buffer, and continue with
the conversion.FLAGS_UNDEFINED_UNDERLINEINFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “_”
character (0x5F) into the output buffer, and continue with
the conversion.FLAGS_UNDEFINED_DEFAULTINFO_UNDEFINED
flag, write some output-encoding–specific character (currently
U+FFFD for Unicode and “?” for all
other encodings) into the output buffer, and continue with the
conversion.In the Unicode-to-text direction, the conversion functions also allow any
of the following extra flags (of which an arbirtrary number can be specified).
In all cases, the usual checks for an exhausted output
buffer are made, and otherwise the INFO_UNDEFINED flag is
set.
FLAGS_UNDEFINED_REPLACEU+00A0 (NO-BREAK
SPACE) could be mapped to 0x20 (SPACE)
in ASCII. Expect this to be poorly supported by the current
implementation.FLAGS_UNDEFINED_REPLACESTRU+00A9 (COPYRIGHT
SIGN) could be mapped to the three-character string
“(C)” in ASCII. Expect this to be poorly
supported by the current implementation.FLAGS_PRIVATE_MAPTO0U+E000–U+F8FF,
U+F0000–U+FFFFD, and
U+100000–U+10FFFD) are mapped to an
(appropriately encoded) ASCII NUL character
(0x00) in the output buffer.FLAGS_NONSPACING_IGNOREU+200B (ZERO
WIDTH SPACE) and U+FEFF (ZERO WIDTH NO-BREAK
SPACE), are ignored. Expect some uncertainty in the current
implementation as to which characters are affected.FLAGS_CONTROL_IGNOREU+0000–U+001F and
U+007F–U+009F) are ignored.FLAGS_PRIVATE_IGNOREU+E000–U+F8FF,
U+F0000–U+FFFFD, and
U+100000–U+10FFFD) are ignored.
There is also a FLAGS_NOCOMPOSITE flag, of which I am not sure
what it should be used for.
An invalid code is a string of one or more units in the input buffer that is not valid according to the input encoding:
0x80 in ASCII, or 0xFF in GB-18030).0xD800 in Unicode, with a following low-surrogate missing, or
0xA1 in EUC-CN, with a second byte in the range
0xA1–0xFE missing).Invalid codes of the second category (that are potentially prefixes of
valid strings) are handled specially at the end of the input buffer. If the
FLAGS_FLUSH flag is specified, they are handled like all other
invalid codes. Otherwise, the INFO_SRCBUFFERTOSMALL flag is set
to indicate that the input buffer possibly ended in the middle of an input
character (and the prefix is either not yet read, or is stored in the
conversion context, or is partly read and partly stored in the conversion
context).
When encountering an invalid code (other than the special cases at the end of the input buffer), the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_INVALID_ERRORINFO_INVALID and the INFO_ERROR flags, and
immediately quit the conversion (ignoring any FLAGS_FLUSH
flag).FLAGS_INVALID_IGNOREINFO_INVALID flag, and continue with the conversion.FLAGS_INVALID_0INFO_INVALID flag,
write an (appropriately encoded) ASCII NUL character
(0x00) into the output buffer, and continue with the
conversion.FLAGS_INVALID_QUESTIONMARKINFO_INVALID flag,
write an (appropriately encoded) ASCII “?”
character (0x3F) into the output buffer, and continue with
the conversion.FLAGS_INVALID_UNDERLINEINFO_INVALID flag,
write an (appropriately encoded) ASCII “_”
character (0x5F) into the output buffer, and continue with
the conversion.FLAGS_INVALID_DEFAULTINFO_INVALID flag,
write some output-encoding–specific character (currently
U+FFFD for Unicode and “?” for all
other encodings) into the output buffer, and continue with the
conversion.If, in the course of conversion, there is not enough space left in the
output buffer (either for a normal character mapping or for a special mapping
of undefined or invalid codes), the INFO_DESTBUFFERTOSMALL flag
is set, and the conversion is quit immediately (ignoring any
FLAGS_FLUSH flag). It is unspecified whether the input units
that would overflow the output buffer are already read (and stored in the
conversion context) or not, but the number of processed input buffer units
returned by the conversion function will be correct in either case.
|
Author: Stephan Bergmann (Last modification $Date: 2001/10/16 15:55:27 $). Copyright 2001 OpenOffice.org Foundation. All Rights Reserved. |