Unicode¶

Characters are not bytes and bytes are not characters (not any more). The Python string type supports Unicode encodings which defines numeric values for a very large set of characters.

ASCII is a subset of Unicode -- the first 127 characters.

In [23]:
# ASCII characters are between 32 and 127
for i in range(32,128):
    print(hex(i),chr(i),end='\t\t\t\t\t\t\t\n'[i%8])
0x20  	0x21 !	0x22 "	0x23 #	0x24 $	0x25 %	0x26 &	0x27 '
0x28 (	0x29 )	0x2a *	0x2b +	0x2c ,	0x2d -	0x2e .	0x2f /
0x30 0	0x31 1	0x32 2	0x33 3	0x34 4	0x35 5	0x36 6	0x37 7
0x38 8	0x39 9	0x3a :	0x3b ;	0x3c <	0x3d =	0x3e >	0x3f ?
0x40 @	0x41 A	0x42 B	0x43 C	0x44 D	0x45 E	0x46 F	0x47 G
0x48 H	0x49 I	0x4a J	0x4b K	0x4c L	0x4d M	0x4e N	0x4f O
0x50 P	0x51 Q	0x52 R	0x53 S	0x54 T	0x55 U	0x56 V	0x57 W
0x58 X	0x59 Y	0x5a Z	0x5b [	0x5c \	0x5d ]	0x5e ^	0x5f _
0x60 `	0x61 a	0x62 b	0x63 c	0x64 d	0x65 e	0x66 f	0x67 g
0x68 h	0x69 i	0x6a j	0x6b k	0x6c l	0x6d m	0x6e n	0x6f o
0x70 p	0x71 q	0x72 r	0x73 s	0x74 t	0x75 u	0x76 v	0x77 w
0x78 x	0x79 y	0x7a z	0x7b {	0x7c |	0x7d }	0x7e ~	0x7f 

The ord() function returns the Unicode value (the ordinal value or Unicode ''code point'') for a character. The chr() function does the revese: it returns the character for a Unicode value.

In [24]:
c='਱'
print(c,hex(ord(c)))
# or the character for the ordinal 
print(0x2551,chr(0x2551))
਱ 0xa31
9553 ║

Most characters have values >128. Here are some random examples:

In [31]:
from random import randint
start=randint(128,20000)
for i in range(start,start+64):
    print(hex(i),chr(i),end=' .. ')
0x4198 䆘 .. 0x4199 䆙 .. 0x419a 䆚 .. 0x419b 䆛 .. 0x419c 䆜 .. 0x419d 䆝 .. 0x419e 䆞 .. 0x419f 䆟 .. 0x41a0 䆠 .. 0x41a1 䆡 .. 0x41a2 䆢 .. 0x41a3 䆣 .. 0x41a4 䆤 .. 0x41a5 䆥 .. 0x41a6 䆦 .. 0x41a7 䆧 .. 0x41a8 䆨 .. 0x41a9 䆩 .. 0x41aa 䆪 .. 0x41ab 䆫 .. 0x41ac 䆬 .. 0x41ad 䆭 .. 0x41ae 䆮 .. 0x41af 䆯 .. 0x41b0 䆰 .. 0x41b1 䆱 .. 0x41b2 䆲 .. 0x41b3 䆳 .. 0x41b4 䆴 .. 0x41b5 䆵 .. 0x41b6 䆶 .. 0x41b7 䆷 .. 0x41b8 䆸 .. 0x41b9 䆹 .. 0x41ba 䆺 .. 0x41bb 䆻 .. 0x41bc 䆼 .. 0x41bd 䆽 .. 0x41be 䆾 .. 0x41bf 䆿 .. 0x41c0 䇀 .. 0x41c1 䇁 .. 0x41c2 䇂 .. 0x41c3 䇃 .. 0x41c4 䇄 .. 0x41c5 䇅 .. 0x41c6 䇆 .. 0x41c7 䇇 .. 0x41c8 䇈 .. 0x41c9 䇉 .. 0x41ca 䇊 .. 0x41cb 䇋 .. 0x41cc 䇌 .. 0x41cd 䇍 .. 0x41ce 䇎 .. 0x41cf 䇏 .. 0x41d0 䇐 .. 0x41d1 䇑 .. 0x41d2 䇒 .. 0x41d3 䇓 .. 0x41d4 䇔 .. 0x41d5 䇕 .. 0x41d6 䇖 .. 0x41d7 䇗 .. 

str.encode() and str.decode() convert strings to/from bytes.

Different encodings require a different number of bytes:

In [26]:
c=chr(0x706b)
print(c,len(c),len(c.encode('utf-8')), len(c.encode('utf-16')), len(c.encode('utf-32')))
print(type(c),type(c.encode('utf-8')),type(c.encode('utf-16')),type(c.encode('utf-32')))
火 1 3 4 8
<class 'str'> <class 'bytes'> <class 'bytes'> <class 'bytes'>

Here's an example of different characters that are each one character but require different number of bytes to encode:

In [27]:
s='RÖ猫𐒎'
for c in s:
    print(c,hex(ord(c)),len(c),len(c.encode('utf-8')), len(c.encode('utf-16')))
R 0x52 1 1 4
Ö 0xd6 1 2 4
猫 0x732b 1 3 4
𐒎 0x1048e 1 4 6