Characters are not bytes and bytes are not characters (not any more). The Python string type supports Unicode encodings which defines numeric values for a very large set of characters.
ASCII is a subset of Unicode -- the first 127 characters.
# ASCII characters are between 32 and 127
for i in range(32,128):
print(hex(i),chr(i),end='\t\t\t\t\t\t\t\n'[i%8])
0x20 0x21 ! 0x22 " 0x23 # 0x24 $ 0x25 % 0x26 & 0x27 ' 0x28 ( 0x29 ) 0x2a * 0x2b + 0x2c , 0x2d - 0x2e . 0x2f / 0x30 0 0x31 1 0x32 2 0x33 3 0x34 4 0x35 5 0x36 6 0x37 7 0x38 8 0x39 9 0x3a : 0x3b ; 0x3c < 0x3d = 0x3e > 0x3f ? 0x40 @ 0x41 A 0x42 B 0x43 C 0x44 D 0x45 E 0x46 F 0x47 G 0x48 H 0x49 I 0x4a J 0x4b K 0x4c L 0x4d M 0x4e N 0x4f O 0x50 P 0x51 Q 0x52 R 0x53 S 0x54 T 0x55 U 0x56 V 0x57 W 0x58 X 0x59 Y 0x5a Z 0x5b [ 0x5c \ 0x5d ] 0x5e ^ 0x5f _ 0x60 ` 0x61 a 0x62 b 0x63 c 0x64 d 0x65 e 0x66 f 0x67 g 0x68 h 0x69 i 0x6a j 0x6b k 0x6c l 0x6d m 0x6e n 0x6f o 0x70 p 0x71 q 0x72 r 0x73 s 0x74 t 0x75 u 0x76 v 0x77 w 0x78 x 0x79 y 0x7a z 0x7b { 0x7c | 0x7d } 0x7e ~ 0x7f
The ord()
function returns the Unicode value (the ord
inal value or Unicode ''code point'') for a character. The chr()
function does the revese: it returns the character for a Unicode value.
c=''
print(c,hex(ord(c)))
# or the character for the ordinal
print(0x2551,chr(0x2551))
0xa31 9553 ║
Most characters have values >128. Here are some random examples:
from random import randint
start=randint(128,20000)
for i in range(start,start+64):
print(hex(i),chr(i),end=' .. ')
0x4198 䆘 .. 0x4199 䆙 .. 0x419a 䆚 .. 0x419b 䆛 .. 0x419c 䆜 .. 0x419d 䆝 .. 0x419e 䆞 .. 0x419f 䆟 .. 0x41a0 䆠 .. 0x41a1 䆡 .. 0x41a2 䆢 .. 0x41a3 䆣 .. 0x41a4 䆤 .. 0x41a5 䆥 .. 0x41a6 䆦 .. 0x41a7 䆧 .. 0x41a8 䆨 .. 0x41a9 䆩 .. 0x41aa 䆪 .. 0x41ab 䆫 .. 0x41ac 䆬 .. 0x41ad 䆭 .. 0x41ae 䆮 .. 0x41af 䆯 .. 0x41b0 䆰 .. 0x41b1 䆱 .. 0x41b2 䆲 .. 0x41b3 䆳 .. 0x41b4 䆴 .. 0x41b5 䆵 .. 0x41b6 䆶 .. 0x41b7 䆷 .. 0x41b8 䆸 .. 0x41b9 䆹 .. 0x41ba 䆺 .. 0x41bb 䆻 .. 0x41bc 䆼 .. 0x41bd 䆽 .. 0x41be 䆾 .. 0x41bf 䆿 .. 0x41c0 䇀 .. 0x41c1 䇁 .. 0x41c2 䇂 .. 0x41c3 䇃 .. 0x41c4 䇄 .. 0x41c5 䇅 .. 0x41c6 䇆 .. 0x41c7 䇇 .. 0x41c8 䇈 .. 0x41c9 䇉 .. 0x41ca 䇊 .. 0x41cb 䇋 .. 0x41cc 䇌 .. 0x41cd 䇍 .. 0x41ce 䇎 .. 0x41cf 䇏 .. 0x41d0 䇐 .. 0x41d1 䇑 .. 0x41d2 䇒 .. 0x41d3 䇓 .. 0x41d4 䇔 .. 0x41d5 䇕 .. 0x41d6 䇖 .. 0x41d7 䇗 ..
str.encode()
and str.decode()
convert strings to/from bytes.
Different encodings require a different number of bytes:
c=chr(0x706b)
print(c,len(c),len(c.encode('utf-8')), len(c.encode('utf-16')), len(c.encode('utf-32')))
print(type(c),type(c.encode('utf-8')),type(c.encode('utf-16')),type(c.encode('utf-32')))
火 1 3 4 8 <class 'str'> <class 'bytes'> <class 'bytes'> <class 'bytes'>
Here's an example of different characters that are each one character but require different number of bytes to encode:
s='RÖ猫𐒎'
for c in s:
print(c,hex(ord(c)),len(c),len(c.encode('utf-8')), len(c.encode('utf-16')))
R 0x52 1 1 4 Ö 0xd6 1 2 4 猫 0x732b 1 3 4 𐒎 0x1048e 1 4 6