Character Encoding (字符编码）

Abstract

Represent different characters in terms of Bit String
There are many different character encoding standards like ASCII, GBK (character encoding) and UTF-8 etc

Incompatible Encoding Standard

Characters encoded in one standard may be displayed differently in another standard.

We should always use the same character encoding standard, UTF-8 is recommended. And if we see characters that are not displayed correctly, we shouldn’t save the file, because saving will overwrite the file with placeholders for those wrongly displayed characters.

UTF-8

UTF-8 allows variable-length encoding, this brings a great amount of space saving to store different characters. As shown above, we use 1 Byte to store a, 4 bytes to store 😊, and 3 bytes to store 家. Without variable-length encoding, we need to use 4 bytes to store each character. Variable-length encoding is achieved with the code point

UTF-8 in Go

String in Go is encoded with UTF-8 and is treated as an Array of Byte. This explains why the index is off and len(myString) returns $8$ , instead of $6$

Abstract away this weird behavior

We can cast the string to an array of rune to have an intuitive interface to the string in Go as shown below. But rune as you can see below is int32, this approach comes with some space sacrifices.

References

锟斤拷�⊠是怎样炼成的——中文显示「⼊」门指南【柴知道】 - YouTube

CS Notes

Recent Updates

GPU

Graph Data Structure

Tree

Explorer

Character Encoding (字符编码）

Abstract

UTF-8

UTF-8 in Go

References

Table of Contents

Backlinks

Graph View