Abstract


Incompatible Encoding Standard

Characters encoded in one standard may be displayed differently in another standard.

We should always use the same character encoding standard, UTF-8 is recommended. And if we see characters that are not displayed correctly, we shouldn’t save the file, because saving will overwrite the file with placeholders for those wrongly displayed characters.

Important

Programming languages like Python use the number of code point of the string as the length of the string, not the number of characters. One example is 👍🏻 which is represented with two code points, you will get two instead of one as the length of the string.

System programming languages more likely to use Unicode Unaware String Function which counts the number of Byte used to store a string.

UTF-8


  • UTF-8 allows variable-length encoding which means code point is variable-length, this brings a great amount of space saving to store different characters. As shown above, we use 1 Byte to store a, 4 bytes to store 😊, and 3 bytes to store . Without variable-length encoding, we need to use 4 bytes to store each character.

Unicode Unaware String Function


  • Unicode Unaware String Function counts the numbers of Byte used by the string, and not the number of characters inside the string
  • String in Go is encoded with UTF-8 and is treated as an Array of Byte. This explains why the index is off and len(myString) returns , instead of

  • This behaviour applies to other languages like C, you get when you run printf("%d", strlen("😊家"));, instead of

Abstract away this weird behavior

We can cast the string to an array of rune to have an intuitive interface to the string in Go as shown below. But rune as you can see below is int32, this approach comes with some space sacrifices.

References