Unicode¶
Characters are not bytes and bytes are not characters (not any more). The Python string type supports Unicode encodings which defines numeric values for a very large set of characters.
ASCII is a subset of Unicode -- the first 127 characters.
In [2]:
# ASCII characters are between 32 and 127
for i in range(32,128):
print(hex(i),chr(i),end='\t\t\t\t\t\t\t\n'[i%8])
0x20 0x21 ! 0x22 " 0x23 # 0x24 $ 0x25 % 0x26 & 0x27 ' 0x28 ( 0x29 ) 0x2a * 0x2b + 0x2c , 0x2d - 0x2e . 0x2f / 0x30 0 0x31 1 0x32 2 0x33 3 0x34 4 0x35 5 0x36 6 0x37 7 0x38 8 0x39 9 0x3a : 0x3b ; 0x3c < 0x3d = 0x3e > 0x3f ? 0x40 @ 0x41 A 0x42 B 0x43 C 0x44 D 0x45 E 0x46 F 0x47 G 0x48 H 0x49 I 0x4a J 0x4b K 0x4c L 0x4d M 0x4e N 0x4f O 0x50 P 0x51 Q 0x52 R 0x53 S 0x54 T 0x55 U 0x56 V 0x57 W 0x58 X 0x59 Y 0x5a Z 0x5b [ 0x5c \ 0x5d ] 0x5e ^ 0x5f _ 0x60 ` 0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f 0x67 g 0x68 h 0x69 i 0x6a j 0x6b k 0x6c l 0x6d m 0x6e n 0x6f o 0x70 p 0x71 q 0x72 r 0x73 s 0x74 t 0x75 u 0x76 v 0x77 w 0x78 x 0x79 y 0x7a z 0x7b { 0x7c | 0x7d } 0x7e ~ 0x7f
The ord()
function returns the Unicode value (the ord
inal value or Unicode ''code point'') for a character. The chr()
function does the revese: it returns the character for a Unicode value.
In [2]:
c='䆞'
print(c,hex(ord(c)))
# or the character for the ordinal
print(0x2551,chr(0x2551))
䆞 0x419e 9553 ║
Most characters have values >128. Here are some random examples:
In [20]:
from random import randint
start=randint(128,20000)
for i in range(start,start+64):
print(hex(i),chr(i),end=' .. ')
0x21b7 ↷ .. 0x21b8 ↸ .. 0x21b9 ↹ .. 0x21ba ↺ .. 0x21bb ↻ .. 0x21bc ↼ .. 0x21bd ↽ .. 0x21be ↾ .. 0x21bf ↿ .. 0x21c0 ⇀ .. 0x21c1 ⇁ .. 0x21c2 ⇂ .. 0x21c3 ⇃ .. 0x21c4 ⇄ .. 0x21c5 ⇅ .. 0x21c6 ⇆ .. 0x21c7 ⇇ .. 0x21c8 ⇈ .. 0x21c9 ⇉ .. 0x21ca ⇊ .. 0x21cb ⇋ .. 0x21cc ⇌ .. 0x21cd ⇍ .. 0x21ce ⇎ .. 0x21cf ⇏ .. 0x21d0 ⇐ .. 0x21d1 ⇑ .. 0x21d2 ⇒ .. 0x21d3 ⇓ .. 0x21d4 ⇔ .. 0x21d5 ⇕ .. 0x21d6 ⇖ .. 0x21d7 ⇗ .. 0x21d8 ⇘ .. 0x21d9 ⇙ .. 0x21da ⇚ .. 0x21db ⇛ .. 0x21dc ⇜ .. 0x21dd ⇝ .. 0x21de ⇞ .. 0x21df ⇟ .. 0x21e0 ⇠ .. 0x21e1 ⇡ .. 0x21e2 ⇢ .. 0x21e3 ⇣ .. 0x21e4 ⇤ .. 0x21e5 ⇥ .. 0x21e6 ⇦ .. 0x21e7 ⇧ .. 0x21e8 ⇨ .. 0x21e9 ⇩ .. 0x21ea ⇪ .. 0x21eb ⇫ .. 0x21ec ⇬ .. 0x21ed ⇭ .. 0x21ee ⇮ .. 0x21ef ⇯ .. 0x21f0 ⇰ .. 0x21f1 ⇱ .. 0x21f2 ⇲ .. 0x21f3 ⇳ .. 0x21f4 ⇴ .. 0x21f5 ⇵ .. 0x21f6 ⇶ ..
str.encode()
and str.decode()
convert strings to/from bytes.
Different encodings require a different number of bytes:
In [5]:
c=chr(0x401)
print(c,len(c),len(c.encode('utf-8')), len(c.encode('utf-16')), len(c.encode('utf-32')))
#print(type(c),type(c.encode('utf-8')),type(c.encode('utf-16')),type(c.encode('utf-32')))
Ё 1 2 4 8
Here's an example of different characters that are each one character but require different number of bytes to encode:
In [6]:
s='RÖ猫𐒎'
print(len(s))
for c in s:
print(c,hex(ord(c)),len(c),len(c.encode('utf-8')), len(c.encode('utf-16')))
4 R 0x52 1 1 4 Ö 0xd6 1 2 4 猫 0x732b 1 3 4 𐒎 0x1048e 1 4 6
In [ ]: