+Table of Contents
DIConverters
Overview
DIConverters supplies 144 character set encodings with two complementary functions each, adding up to a total of more than 288 character conversion functions:
- to decode from encoding to Unicode.
- to encode from Unicode to encoding.
All conversion are fully native and require no DLL or system dependencies. Applications build with DIConverters therefore run on all Win32 platforms starting from (and including!) Windows 95.
The converter functions allow for smart-linking: Only those functions used by the application are actually included into the executable. This keeps applications small when only one or a few character conversions are needed.
Click here for a listing character sets and encodings supported by DIConverters.
Using DIConverters
All conversions take place on a Unicode character base. In multi-byte character encodings, a single Unicode character is represented by one or more bytes.
DIConverters can be used by DIUnicode, which contains comfortable classes with automatic character conversion for both reading and writing Unicode text. With DIUnicode, all text operations take place on a WideChar / WideString basis regardless of the actual text encoding. This allows applications to use the very same import / export routine on all 144 character sets and encodings.
Conversion Preparations
Functions for both direct decoding and encoding require a conversion state variable of type conv_t, which is a record structure defined in DIConverters.pas. Before actually starting a direct character coding, this variable must be initialized with zeros. Applications can easily accomplished this with the following standard Pascal call:
var conv: conv_struct; begin FillChar(conv, SizeOf(conv), 0);
You can then proceed using the decoding and encoding functions described below.
Reading with Unicode Decoding
The function prototype to decode multi-byte encodings to Unicode is:
xxx_mbtowc = function( const conv: conv_t; var pwc: ucs4_t; const s: Pointer; const n: Integer): Integer;
The xxx stands for the actual character encoding, like utf8_mbtowc.
It converts the byte sequence starting at s to a Unicode code point. Up to n bytes must be available at s, and n >= 1. The Unicode representation is stored in pwc.
The function's return value indicates if the conversion was successful:
- number of bytes consumed: Success, a wide character was read.
- -1: The byte sequence at s is invalid.
- -2: The number of bytes n is too small.
- -2-(number of bytes consumed): Only a shift sequence was read.
A few encodings may require xxx_mbtowc to be combined with xxx_flushwc:
xxx_flushwc = function( const conv: conv_t; var pwc: ucs4_t): Integer;
xxx_flushwc returns to the initial state and stores the pending wide character, if any. The result is 1 (if a wide character was read) or 0 if none was pending.
Calling xxx_flushwc is not required for most encodings.
Writing with Unicode Encoding
The function prototype to encode a Unicode code point to multi-byte is:
xxx_wctomb = function( const conv: conv_t; const r: Pointer; const wc: ucs4_t; const n: Integer): Integer;
The xxx stands for the actual character encoding, like utf8_mbtowc.
The function converts the wide character wc to the character set xxx, and stores the result beginning at r. Up to n bytes may be written at r. n is >= 1.
The function's result is the number of bytes written, or -1 if invalid, or -2 if n is too small.
To write any pending characters and return to the original state, a call to xxx_reset may be required for some encodings:
xxx_reset = function( const conv: conv_t; const r: Pointer; const n: Integer): Integer;
It stores a shift sequences returning to the initial state beginning at r. Up to n bytes may be written at r. n is >= 0. It returns the number of bytes written, or -2 if n is too small.