Iterating over graphemes in PHP

Graphemes are the basic units of a writing system, corresponding to the characters a user perceives when editing text. Features such as cursor movement and text selection rely on accurately splitting text into graphemes (a process known as segmentation).

User-perceived characters may consist of multiple Unicode code points — for example, é may be represented as the code point for e followed by the code point for the acute accent, and 🏳️‍🌈 is represented by the code points for 🏳️, an invisible joiner, and 🌈.

Annex 29 to the Unicode standard defines an algorithm for segmenting text into grapheme clusters, and PHP provides functions based on this algorithm that can be used to iterate over graphemes.

grapheme_extract

The grapheme_extract function has been available since PHP 5.3 (released in 2009). It returns a number of graphemes from a UTF-8 string, based on a specified starting offset and a maximum number of graphemes, bytes, or Unicode code points. This generator function uses grapheme_extract to return an iterator over graphemes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
/**
 * Returns an iterator over graphemes in a string.
 *
 * @param string $string The string
 *
 * @return \Iterator<string>
 */
function graphemes(string $string): \Iterator {
  $next = 0;

  while (true) {
    $grapheme = grapheme_extract($string, 1, GRAPHEME_EXTR_COUNT, $next, $next);

    if ($grapheme === false) {
      break;
    }

    yield $grapheme;
  }
}

The returned iterator can be used in a foreach loop:

1
2
3
4
foreach (graphemes('🏳️🏳️‍🌈🏳️‍⚧️') as $grapheme) {
  $size = strlen($grapheme);
  echo "$grapheme is $size bytes in size\n";
}

This produces the following output:

🏳️ is 7 bytes in size
🏳️‍🌈 is 14 bytes in size
🏳️‍⚧️ is 16 bytes in size

grapheme_str_split

The grapheme_str_split function has been available since PHP 8.4 (released in 2024). It splits a string into chunks containing a specified number of graphemes (defaulting to 1) and returns an array containing these chunks, which can be used directly in a foreach loop:

1
2
3
4
foreach (grapheme_str_split('🏳️🏳️‍🌈🏳️‍⚧️') as $grapheme) {
  $size = strlen($grapheme);
  echo "$grapheme is $size bytes in size\n";
}

When working with very large strings, grapheme_str_split uses significantly more memory than the graphemes function shown earlier, as it allocates memory for the entire array of graphemes whereas the iterator returned by graphemes extracts individual graphemes on demand.