Unicode¶

Characters are not bytes and bytes are not characters (not any more). The Python string type supports Unicode encodings which defines numeric values for a very large set of characters.

ASCII is a subset of Unicode -- the first 127 characters.

In [2]:
# ASCII characters are between 32 and 127
for i in range(32,128):
    print(hex(i),chr(i),end='\t\t\t\t\t\t\t\n'[i%8])
0x20  	0x21 !	0x22 "	0x23 #	0x24 $	0x25 %	0x26 &	0x27 '
0x28 (	0x29 )	0x2a *	0x2b +	0x2c ,	0x2d -	0x2e .	0x2f /
0x30 0	0x31 1	0x32 2	0x33 3	0x34 4	0x35 5	0x36 6	0x37 7
0x38 8	0x39 9	0x3a :	0x3b ;	0x3c <	0x3d =	0x3e >	0x3f ?
0x40 @	0x41 A	0x42 B	0x43 C	0x44 D	0x45 E	0x46 F	0x47 G
0x48 H	0x49 I	0x4a J	0x4b K	0x4c L	0x4d M	0x4e N	0x4f O
0x50 P	0x51 Q	0x52 R	0x53 S	0x54 T	0x55 U	0x56 V	0x57 W
0x58 X	0x59 Y	0x5a Z	0x5b [	0x5c \	0x5d ]	0x5e ^	0x5f _
0x60 `	0x61 a	0x62 b	0x63 c	0x64 d	0x65 e	0x66 f	0x67 g
0x68 h	0x69 i	0x6a j	0x6b k	0x6c l	0x6d m	0x6e n	0x6f o
0x70 p	0x71 q	0x72 r	0x73 s	0x74 t	0x75 u	0x76 v	0x77 w
0x78 x	0x79 y	0x7a z	0x7b {	0x7c |	0x7d }	0x7e ~	0x7f 

The ord() function returns the Unicode value (the ordinal value or Unicode ''code point'') for a character. The chr() function does the revese: it returns the character for a Unicode value.

In [2]:
c='䆞'
print(c,hex(ord(c)))
# or the character for the ordinal 
print(0x2551,chr(0x2551))
䆞 0x419e
9553 ║

Most characters have values >128. Here are some random examples:

In [20]:
from random import randint
start=randint(128,20000)
for i in range(start,start+64):
    print(hex(i),chr(i),end=' .. ')
0x21b7 ↷ .. 0x21b8 ↸ .. 0x21b9 ↹ .. 0x21ba ↺ .. 0x21bb ↻ .. 0x21bc ↼ .. 0x21bd ↽ .. 0x21be ↾ .. 0x21bf ↿ .. 0x21c0 ⇀ .. 0x21c1 ⇁ .. 0x21c2 ⇂ .. 0x21c3 ⇃ .. 0x21c4 ⇄ .. 0x21c5 ⇅ .. 0x21c6 ⇆ .. 0x21c7 ⇇ .. 0x21c8 ⇈ .. 0x21c9 ⇉ .. 0x21ca ⇊ .. 0x21cb ⇋ .. 0x21cc ⇌ .. 0x21cd ⇍ .. 0x21ce ⇎ .. 0x21cf ⇏ .. 0x21d0 ⇐ .. 0x21d1 ⇑ .. 0x21d2 ⇒ .. 0x21d3 ⇓ .. 0x21d4 ⇔ .. 0x21d5 ⇕ .. 0x21d6 ⇖ .. 0x21d7 ⇗ .. 0x21d8 ⇘ .. 0x21d9 ⇙ .. 0x21da ⇚ .. 0x21db ⇛ .. 0x21dc ⇜ .. 0x21dd ⇝ .. 0x21de ⇞ .. 0x21df ⇟ .. 0x21e0 ⇠ .. 0x21e1 ⇡ .. 0x21e2 ⇢ .. 0x21e3 ⇣ .. 0x21e4 ⇤ .. 0x21e5 ⇥ .. 0x21e6 ⇦ .. 0x21e7 ⇧ .. 0x21e8 ⇨ .. 0x21e9 ⇩ .. 0x21ea ⇪ .. 0x21eb ⇫ .. 0x21ec ⇬ .. 0x21ed ⇭ .. 0x21ee ⇮ .. 0x21ef ⇯ .. 0x21f0 ⇰ .. 0x21f1 ⇱ .. 0x21f2 ⇲ .. 0x21f3 ⇳ .. 0x21f4 ⇴ .. 0x21f5 ⇵ .. 0x21f6 ⇶ .. 

str.encode() and str.decode() convert strings to/from bytes.

Different encodings require a different number of bytes:

In [5]:
c=chr(0x401)
print(c,len(c),len(c.encode('utf-8')), len(c.encode('utf-16')), len(c.encode('utf-32')))
#print(type(c),type(c.encode('utf-8')),type(c.encode('utf-16')),type(c.encode('utf-32')))
Ё 1 2 4 8

Here's an example of different characters that are each one character but require different number of bytes to encode:

In [6]:
s='RÖ猫𐒎'
print(len(s))
for c in s:
    print(c,hex(ord(c)),len(c),len(c.encode('utf-8')), len(c.encode('utf-16')))
4
R 0x52 1 1 4
Ö 0xd6 1 2 4
猫 0x732b 1 3 4
𐒎 0x1048e 1 4 6
In [ ]: