Welcome Guestlogin to KGsePGregister at KGsePG email | FAQs

Unicode - From Ucs-2 To Utf-16

download

    1 of 26

    Unicode - From Ucs-2 To Utf-16



    Unicode - From Ucs-2 To Utf-16 - Transcript


    From UCS 2 to UTF 16
    Discussion and practical example for the transition of a Unicode library from UCS 2 to UTF 16

    Why is this an issue
    The concept of the Unicode standard changed during its first few years Unicode 2 0 1996 expanded the code point range from 64k to 1 1M APIs and libraries need to follow this change and support the full range Upcoming character assignments Unicode 3 1 2001 fall into the added range

    Unicode is a 16 bit character set
    Concept 16 bit fixed width character set Saving space by not including precomposed rarely used obsolete characters Compatibility transition strategies and acceptance forced loosening of these principles Unicode 3 1 90k assigned characters

    16 bit APIs
    APIs developed for Unicode 1 1 used 16 bit characters and strings UCS 2 Assuming 1 1 character code unit Examples Win32 Java COM ICU Qt KDE Byte based UTF 8 1993 mostly for MBCS compatibility and transfer protocols

    Extending the range
    Set aside two blocks of 1k 16 bit values surrogates for extension 1k x 1k 1M 10000016 additional code points using a pair of code units 16 bit form now variable width UTF 16 Unicode scalar values 0 10ffff16 Proposed 1994 part of Unicode 2 0 1996

    Parallel with ISO 10646
    ISO 10646 uses 31 bit codes UCS 4 UCS 2 16 bit codes for subset 0 ffff16 UTF 16 transformation of subset 0 10ffff 16 UTF 8 covers all 31 bits Private Use areas above 10ffff16 slated for removal from ISO 10646 for UTF interoperability and synchronization with Unicode

    21 bit code points
    Code points Unicode scalar values up to 10ffff16 use 21 bits 16 bit code units still good for strings variable width like MBCS Default string unit size not big enough for code points Dual types for programming

    C char wchar t dual types
    C C standards dual types Strings mostly with char units 8 bits Code points wchar t 8 32 bits Typical use in I18N ed programs 8 bit char strings but 16 32 bit wchar t or 32 bit int characters code point type is implementation dependent

    Unicode dual types too
    Strings could continue with 16 bit units Single code points could get 32 bit data type Dual type model like C C MBCS

    Alternatives to dual 16 32 types
    UTF 32 all types 32 bits wide fixed width UTF 8 same complexity after range extension beyond just the BMP closer to C C model byte based Use pairs of 16 bit units Use strings for everything Make string unit size flexible 8 16 32 bits

    UCS 2 to UTF 32
    Fixed width single base type for strings and code points UCS 2 programming assumptions mostly intact Wastes at least 33 space typically 50 Performance bottleneck CPU memory

    UCS 2 to UTF 8
    UCS 2 programming assumes many characters in single code units Breaks a lot of code Same question of type for code points follow C model 32 bit wchar t More difficult transition than other choices

    Surrogate pairs for single chars
    Caller avoids code point calculation But caller and callee need to detect and handle pairs caller choosing argument values callee checking for errors Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars

    Strings for single chars
    Always pass in string and offset Most general handles graphemes in addition to code points Harder to use with code point constants because they are published as scalar values Significant change for caller from using scalars

    UTF flexible
    In principle if the implementation can handle variable width MBCS style strings could it handle any UTF size as a compiletime choice Adds interoperability with UTF 8 32 APIs Almost no assumptions possible Complexity of transition even higher than of transition to pure UTF 8 performance

    Interoperability
    Break existing API users no more than necessary Interoperability with other APIs Win32 Java COM now also XML DOM UTF 16 is Unicode default good compromise speed ease space String units should stay 16 bits wide

    Does everything need to change
    String operations search substring concatenation work with any UTF without change Character property lookup and similar need to support the extended range Formatting should handle more code points or even graphemes Careful evaluation of all public APIs

    ICU some of all
    Strings UTF 16 UChar type remains 16bit New UChar32 for code points Provide macros for C to deal with all UTFs iteration random access C CharacterIterator many new functions Property lookup low level UChar32 Formatting strings for graphemes

    Scalar code points property lookup
    Old 16 bit UChar u tolower UChar c u v c15 7 c6 0 New 21 bit UChar32 u tolower UChar32 c u v w c20 10 c9 4 c3 0

    Formatting grapheme strings
    Old void setDecimalSymbol UChar c New void setDecimalSymbol const UnicodeString s

    Codepage conversion
    To Unicode results are one or two UTF 16 code units surrogates stored directly in the conversion table From Unicode triple stage compact array access from 21 bit code points like property lookup Single character conversion to Unicode now returns UChar32 values

    API first
    Tools and basic functions and classes are in place property lookup conversion iterators BiDi Public APIs reviewed and changed luxury of early project stage or deprecated and superseded by new versions Higher level implementations to follow before Unicode 3 1 published

    More implementations follow
    Collation need to prepare for 64k primary keys Normalization and Transliteration Word Sentence break iteration Etc No non BMP data before Unicode 3 1 is stable

    Other libraries
    Java planning stage for transition Win32 rendering and UniScribe API largely UTF 16 ready Linux standardizing on 32 bit Unicode wchar t has UTF 8 locales like other Unixes for char APIs W3C standards assume full UTF 16 range

    Summary
    Transition from UCS 2 to UTF 16 gains importance after four years of standard APIs for single characters need change or new versions String APIs no change Implementations need to handle 21 bit code points Range of options

    Resources
    Unicode FAQ http www unicode org unicode faq Unicode on IBM developerWorks http www ibm com developer unicode ICU http oss software ibm com icu