Go by Example: Strings and Runes

Go 1.23

Dive into Go's handling of text with Strings and Runes. This example explains the difference between bytes and Unicode code points (runes), demonstrating how to correctly iterate over UTF-8 encoded strings and access individual characters.

Code

package main

import "fmt"

func main() {
    // String basics
    s := "Hello, 世界"
    fmt.Println("String:", s)
    fmt.Println("Length in bytes:", len(s))
    
    // Rune (Unicode code point)
    for i, r := range s {
        fmt.Printf("Index %d: %c (rune: %v)\n", i, r, r)
    }
    
    // String indexing (bytes)
    fmt.Println("First byte:", s[0])
}

Explanation

In Go, a string is fundamentally a read-only slice of bytes containing UTF-8 encoded text. This design differs from languages where strings are arrays of characters. Go introduces the concept of a rune, which is an alias for int32 and represents a Unicode code point. This distinction between bytes and runes is crucial for correctly handling international text.

Because UTF-8 is a variable-width encoding, a single character can occupy 1 to 4 bytes. ASCII characters like 'H' use 1 byte, while characters from languages like Chinese ('世界') typically use 3 bytes each. The built-in len() function returns the byte count, not the character (rune) count. For the string "Hello, 世界", len() returns 13 bytes (7 for "Hello, " plus 6 for the two Chinese characters), even though there are only 9 characters total.

Iterating over a string with the range keyword automatically decodes UTF-8, yielding one rune at a time along with its starting byte index. This built-in UTF-8 awareness makes range the safest way to process Unicode strings. Direct indexing with s[0] accesses individual bytes,not runes, which can split multi-byte characters and produce invalid partial characters. Go source files themselves are UTF-8 encoded by default, allowing Unicode literals directly in code.

Rune Literals: Rune literals are enclosed in single quotes (e.g., 'A', '⌘') and are just 32-bit integers, whereas string literals use double quotes ("hello") or backticks.

Code Breakdown

7

Declaring a string variable with UTF-8 content. Go source files are UTF-8 by default, so you can include Unicode characters directly in string literals.

9

len(s) returns the byte length, not the character count. "Hello, 世界" is 7 bytes for "Hello, " + 6 bytes for "世界" (3 bytes each) = 13 bytes total. This is different from the 9 runes (characters) in the string.

12-14

The range loop automatically handles UTF-8 decoding. Variable 'i' is the byte index where the rune starts, and 'r' is the rune value (Unicode code point). The loop skips bytes to find the start of the next rune. '世' starts at byte index 7, and '界' starts at byte index 10.

17

String indexing with s[0] accesses individual bytes, not runes. This returns the first byte (72, which is 'H' in ASCII). Be careful: indexing multi-byte characters this way will only give you part of the character.

Samplebadu

Go by Example: Strings and Runes

Code

Explanation

Code Breakdown