Iterating over graphemes in PHP
Graphemes are the basic units of a writing system, corresponding to the characters a user perceives when editing text. Features such as cursor movement and text selection rely on accurately splitting text into graphemes (a process known as segmentation).
User-perceived characters may consist of multiple Unicode code points — for example, é may be represented as the code point for e followed by the code point for the acute accent, and 🏳️🌈 is represented by the code points for 🏳️, an invisible joiner, and 🌈.
Annex 29 to the Unicode standard defines an algorithm for segmenting text into grapheme clusters, and PHP provides functions based on this algorithm that can be used to iterate over graphemes.
grapheme_extract
The grapheme_extract
function has been available since PHP 5.3 (released in 2009). It returns a number of graphemes from a UTF-8 string, based on a specified starting offset and a maximum number of graphemes, bytes, or Unicode code points. This generator function uses grapheme_extract
to return an iterator over graphemes:
|
|
The returned iterator can be used in a foreach
loop:
|
|
This produces the following output:
🏳️ is 7 bytes in size 🏳️🌈 is 14 bytes in size 🏳️⚧️ is 16 bytes in size
grapheme_str_split
The grapheme_str_split
function has been available since PHP 8.4 (released in 2024). It splits a string into chunks containing a specified number of graphemes (defaulting to 1) and returns an array containing these chunks, which can be used directly in a foreach
loop:
|
|
When working with very large strings, grapheme_str_split
uses significantly more memory than the graphemes
function shown earlier, as it allocates memory for the entire array of graphemes whereas the iterator returned by graphemes
extracts individual graphemes on demand.