Abstract
- Represent different characters in terms of Bit String
- There are many different character encoding standards like ASCII, GBK (character encoding) and UTF-8 etc
Incompatible Encoding Standard
Characters encoded in one standard may be displayed differently in another standard.
We should always use the same character encoding standard, UTF-8 is recommended. And if we see characters that are not displayed correctly, we shouldn’t save the file, because saving will overwrite the file with placeholders for those wrongly displayed characters.
UTF-8
- UTF-8 allows variable-length encoding, this brings a great amount of space saving to store different characters. As shown above, we use 1 Byte to store a, 4 bytes to store 😊, and 3 bytes to store 家. Without variable-length encoding, we need to use 4 bytes to store each character. Variable-length encoding is achieved with the code point
UTF-8 in Go
- String in Go is encoded with UTF-8 and is treated as an Array of Byte. This explains why the index is off and
len(myString)
returns , instead of
Abstract away this weird behavior
We can cast the string to an array of
rune
to have an intuitive interface to the string in Go as shown below. Butrune
as you can see below isint32
, this approach comes with some space sacrifices.