OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
From: Glynn Clements (glynn.clementsvirgin.net)
Date: Thu Jun 07 2001 - 22:51:57 CDT

  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

    Eric Hacker wrote:

    > Unicode is a superset of ACSII and thus all ASCII characters are Unicode.
    > UTF8 is a way of encoding unicode code points for transport over the
    > internet in a restricted character set. Conveniently, UTF8 uses the same
    > values as ASCII for ASCII representation. Above the standard ASCII 127
    > character representation, UTF8 uses multi-byte strings beginning with 0xC1.

    No; the sequences for codes 128 to 255 begin with 0xC2 and 0xC3
    (128-191 and 192-255 respectively). 0xC0 and 0xC1 indicate (illegal)
    overlong encodings of 0-63 and 64-127 respectively.

    In general, the two-byte sequences have the (binary) form:

       110xxxxx 10xxxxxx

    The range 0-127 (which must use the single-byte form instead)
    corresponds to:

       1100000x 10xxxxxx

    Hence, any sequence beginning with 11000000 (0xC0) or 11000001 (0xC1)
    is illegal.

    -- 
    Glynn Clements <glynn.clementsvirgin.net>