Recently axe-core decided to adopt a size budget as part of an ongoing effort to improve our performance. We noted that axe-cores file size had more than doubled since the last major release and that we needed to start monitoring our size better.
With the budget in place, I started investigating ways to reduce the file size. We use webpack to bundle axe-core so I used webpack-bundle-analyzer to see a detailed view of how large each file was.
As you may have noticed, there are a few large files that make up our core codebase. The largest of those is a file called valid-langs.js
. This particular file is 48 kB minified and 20 kB gzipped. Looking into the file it was a list of all ISO 639-1 and 2 codes that we use to validate the value of the HTML lang
attribute.
Thinking about how to reduce the file size, someone from the Google Lighthouse team suggested using a trie data structure. It was a good idea, but I was having trouble figuring out how to store the structure that would result in a smaller file size.
I then remember that there was a music player used for the js13kGames game jam that could store entire songs in less than 1 kB gzipped. They used a technique to store all the information of a song as "a series of nested arrays containing instrument, pattern and sequence data." It turns out JavaScript is really good at compressing nested arrays of numbers.
I figured if they could store songs in such a small size I could use the same technique to store a trie of stings. I discovered that I could store the data as a sequence of nested arrays, where each index represented the letter (a = 0, b = 1, c = 2, etc.). For example, the string aaa
could be stored as [[[1]]]
. I could then check if the string aaa
was valid by converting each letter to its index and navigating the nested arrays. If at any point the index returned null
, the string was invalid.
However, this resulted in one and two character strings that started with the same sequence to also be valid. For example, the string a
should be invalid but would return valid since there are valid 3 character strings that start with a
.
To account for this, I decided to make the letters 1 indexed so that I could use the 0th index to denote if the one or two character string was valid (by padding the end of the string with the `
character). For example, the string aaa
and aa
are valid, and can be stored as [,[,[1,1]]
. Since the string a
is turned into a``
, it returns invalid as the 2nd nested arrays 0th index is empty.
Doing this to all ISO 639-1 and 2 codes resulted in a total file size of 2 kB gzipped, about 17 kB in savings!
Below is the code for converting an array of strings into a trie of nested arrays and a function to determine if a given string is valid:
const strings = [...]; // array of strings
const enocdedStrings = [];
strings.forEach(str => {
str = str.padEnd(3, '`');
let array = enocdedStrings;
str.split('').forEach((char, i) => {
// index is 1 indexed
const index = char.charCodeAt(0) - 96;
if (i < str.length - 1) {
array[index] = array[index] || [];
array = array[index];
}
else {
array[index] = 1;
}
});
});
/**
* Determine if a string is a valid entry in the trie.
* @param {String} str - String to validate.
* @return {Boolean}
*/
function isValid(str) {
let array = enocdedStrings;
str = str.padEnd(3, '`');
for (let i = 0; i <= str.length - 1; i++) {
const index = str.charCodeAt(i) - 96;
array = array[index];
if (!array) {
return false;
}
}
return true;
}