Compressing the Wordle dictionary
The popular word-guessing game Wordle is implemented using a 173KB JavaScript bundle, reduced to a 58KB transfer using Brotli compression. By efficiently compressing the dictionary, we can reduce the JavaScript bundle size by 36% to 111KB and the compressed transfer size by 26% to 43KB.
The dictionary
The Wordle JavaScript bundle contains a 10,657-word dictionary listing all of the five-letter words that Wordle will accept, accounting for 48% of the bundle size (83KB). The dictionary is stored as a simple JavaScript array:
|
|
Text compression
The Brotli compression algorithm, like its predecessor gzip, can compress arbitrary text. The order of words matters when compressing most text, but in the case of an alphabetical list this information isn’t needed, as the words could have been stored in any order and later sorted into alphabetical order.
We can calculate how much information is required to represent the order of words. For a list of 10,657 unique words there are 10,657 × 10,656 × … × 3 × 2 × 1 possible permutations. Taking the binary logarithm of this number tells us that 127,219 bits are required to represent a permutation. This means we could potentially save over 15KB by only needing to represent one of the possible permutations.
Word differences
Rather than storing the words themselves, we can instead store the first word followed by the list of differences between consecutive words. This requires much less information to be stored if the words are arranged to reduce the differences between neighbouring words.
Arranging the words alphabetically is an effective way of reducing these differences, as neighbouring words usually share a common prefix — for example, cared, carer, cares, and caret differ only in the last letter.
These are the frequencies of each type of difference in the Wordle dictionary (where ⬜ represents a shared letter and ❌ represents a different letter):
Difference | Frequency | |
---|---|---|
⬜ ⬜ ⬜ ⬜ ❌ | 2946 | |
⬜ ⬜ ⬜ ❌ ⬜ | 1022 | |
⬜ ⬜ ⬜ ❌ ❌ | 4040 | |
⬜ ⬜ ❌ ⬜ ⬜ | 43 | |
⬜ ⬜ ❌ ⬜ ❌ | 106 | |
⬜ ⬜ ❌ ❌ ⬜ | 358 | |
⬜ ⬜ ❌ ❌ ❌ | 1832 | |
⬜ ❌ ⬜ ⬜ ❌ | 1 | |
⬜ ❌ ⬜ ❌ ⬜ | 1 | |
⬜ ❌ ⬜ ❌ ❌ | 6 | |
⬜ ❌ ❌ ⬜ ⬜ | 1 | |
⬜ ❌ ❌ ⬜ ❌ | 14 | |
⬜ ❌ ❌ ❌ ⬜ | 43 | |
⬜ ❌ ❌ ❌ ❌ | 218 | |
❌ ❌ ⬜ ❌ ⬜ | 2 | |
❌ ❌ ❌ ⬜ ❌ | 1 | |
❌ ❌ ❌ ❌ ⬜ | 4 | |
❌ ❌ ❌ ❌ ❌ | 18 |
From this table, we can see that 85% of the time a difference in one letter means all of the following letters differ as well. This isn’t surprising, as the first difference between two words determines their alphabetical order and the following letters will only match by chance.
Encoding suffixes
If we assume that a difference in one letter means all of the following letters differ, we only need to store the change to the suffix. We don’t need to store the position of the letters as this is determined by the length of the suffix — for example, a three-letter suffix will replace the final three letters of the preceding word. We gain more by not storing the positions than we lose due to the suffix occasionally including letters than didn’t differ.
These are the changed suffixes for the ten words following the first word (aahed) in the Wordle dictionary:
Word | Suffix |
---|---|
aalii | lii |
aargh | rgh |
aarti | ti |
abaca | baca |
abaci | i |
abacs | s |
abaft | ft |
abaka | ka |
abamp | mp |
aband | nd |
When storing the suffixes we need to be able to tell where one suffix ends and another begins. A simple way to do this is just to capitalise the first letter of each suffix. For the ten words above, this results in the string LiiRghTiBacaISFtKaMpNd. This suffix string is 56% shorter than the total length of the original words — 22 characters in comparison to 50 characters.
Decoding suffixes
To decode the suffix string back into the original word list, we initialise the word list to the first word, split the suffix string before capital letters, and then loop over the suffixes, adding each reconstructed word to the end of the word list:
|
|
The full version of this code can be found in suffix-decoder.js (21,460 bytes).
Size comparison
Encoding the dictionary as suffixes is so efficient that even without Brotli compression it’s smaller than the Brotli-compressed version of the original dictionary:
Dictionary | Original | Improved | Saving |
---|---|---|---|
Uncompressed | 83KB | 21KB | 75% |
Compressed | 28KB | 14KB | 50% |
We still see gains from Brotli-compressing the suffix-encoded dictionary because Brotli produces an efficient binary encoding.
Because the dictionary forms such a large fraction of the JavaScript bundle, the improvements are still significant once the suffix-encoded dictionary is substituted into the bundle:
Full bundle | Original | Improved | Saving |
---|---|---|---|
Uncompressed | 173KB | 111KB | 36% |
Compressed | 58KB | 43KB | 26% |
Significantly, the 15KB reduction in the compressed size matches the potential gains expected from ignoring the order of words.