Compressing the Wordle dictionary
The Brotli compression algorithm, like its predecessor gzip, can compress arbitrary text. The order of words matters when compressing most text, but in the case of an alphabetical list this information isn’t needed, as the words could have been stored in any order and later sorted into alphabetical order.
We can calculate how much information is required to represent the order of words. For a list of 10,657 unique words there are 10,657 × 10,656 × … × 3 × 2 × 1 possible permutations. Taking the binary logarithm of this number tells us that 127,219 bits are required to represent a permutation. This means we could potentially save over 15KB by only needing to represent one of the possible permutations.
Rather than storing the words themselves, we can instead store the first word followed by the list of differences between consecutive words. This requires much less information to be stored if the words are arranged to reduce the differences between neighbouring words.
Arranging the words alphabetically is an effective way of reducing these differences, as neighbouring words usually share a common prefix — for example, cared, carer, cares, and caret differ only in the last letter.
These are the frequencies of each type of difference in the Wordle dictionary (where ⬜ represents a shared letter and ❌ represents a different letter):
|⬜ ⬜ ⬜ ⬜ ❌||2946|
|⬜ ⬜ ⬜ ❌ ⬜||1022|
|⬜ ⬜ ⬜ ❌ ❌||4040|
|⬜ ⬜ ❌ ⬜ ⬜||43|
|⬜ ⬜ ❌ ⬜ ❌||106|
|⬜ ⬜ ❌ ❌ ⬜||358|
|⬜ ⬜ ❌ ❌ ❌||1832|
|⬜ ❌ ⬜ ⬜ ❌||1|
|⬜ ❌ ⬜ ❌ ⬜||1|
|⬜ ❌ ⬜ ❌ ❌||6|
|⬜ ❌ ❌ ⬜ ⬜||1|
|⬜ ❌ ❌ ⬜ ❌||14|
|⬜ ❌ ❌ ❌ ⬜||43|
|⬜ ❌ ❌ ❌ ❌||218|
|❌ ❌ ⬜ ❌ ⬜||2|
|❌ ❌ ❌ ⬜ ❌||1|
|❌ ❌ ❌ ❌ ⬜||4|
|❌ ❌ ❌ ❌ ❌||18|
From this table, we can see that 85% of the time a difference in one letter means all of the following letters differ as well. This isn’t surprising, as the first difference between two words determines their alphabetical order and the following letters will only match by chance.
If we assume that a difference in one letter means all of the following letters differ, we only need to store the change to the suffix. We don’t need to store the position of the letters as this is determined by the length of the suffix — for example, a three-letter suffix will replace the final three letters of the preceding word. We gain more by not storing the positions than we lose due to the suffix occasionally including letters than didn’t differ.
These are the changed suffixes for the ten words following the first word (aahed) in the Wordle dictionary:
When storing the suffixes we need to be able to tell where one suffix ends and another begins. A simple way to do this is just to capitalise the first letter of each suffix. For the ten words above, this results in the string LiiRghTiBacaISFtKaMpNd. This suffix string is 56% shorter than the total length of the original words — 22 characters in comparison to 50 characters.
To decode the suffix string back into the original word list, we initialise the word list to the first word, split the suffix string before capital letters, and then loop over the suffixes, adding each reconstructed word to the end of the word list:
The full version of this code can be found in suffix-decoder.js (21,460 bytes).
Encoding the dictionary as suffixes is so efficient that even without Brotli compression it’s smaller than the Brotli-compressed version of the original dictionary:
We still see gains from Brotli-compressing the suffix-encoded dictionary because Brotli produces an efficient binary encoding.
Significantly, the 15KB reduction in the compressed size matches the potential gains expected from ignoring the order of words.