Reducing Axe-core File Size: Trie Stored as Nested Arrays

Recently axe-core decided to adopt a size budget as part of an ongoing effort to improve our performance. We noted that axe-cores file size had more than doubled since the last major release and that we needed to start monitoring our size better.

With the budget in place, I started investigating ways to reduce the file size. We use webpack to bundle axe-core so I used webpack-bundle-analyzer to see a detailed view of how large each file was.

Webpack-bundle-analyzer output of axe-core. Click for larger image

As you may have noticed, there are a few large files that make up our core codebase. The largest of those is a file called valid-langs.js. This particular file is 48 kB minified and 20 kB gzipped. Looking into the file it was a list of all ISO 639-1 and 2 codes that we use to validate the value of the HTML lang attribute.

Thinking about how to reduce the file size, someone from the Google Lighthouse team suggested using a trie data structure. It was a good idea, but I was having trouble figuring out how to store the structure that would result in a smaller file size.

I then remember that there was a music player used for the js13kGames game jam that could store entire songs in less than 1 kB gzipped. They used a technique to store all the information of a song as "a series of nested arrays containing instrument, pattern and sequence data." It turns out JavaScript is really good at compressing nested arrays of numbers.

I figured if they could store songs in such a small size I could use the same technique to store a trie of stings. I discovered that I could store the data as a sequence of nested arrays, where each index represented the letter (a = 0, b = 1, c = 2, etc.). For example, the string aaa could be stored as [[[1]]]. I could then check if the string aaa was valid by converting each letter to its index and navigating the nested arrays. If at any point the index returned null, the string was invalid.

However, this resulted in one and two character strings that started with the same sequence to also be valid. For example, the string a should be invalid but would return valid since there are valid 3 character strings that start with a.

To account for this, I decided to make the letters 1 indexed so that I could use the 0th index to denote if the one or two character string was valid (by padding the end of the string with the ` character). For example, the string aaa and aa are valid, and can be stored as [,[,[1,1]]. Since the string a is turned into a``, it returns invalid as the 2nd nested arrays 0th index is empty.

Doing this to all ISO 639-1 and 2 codes resulted in a total file size of 2 kB gzipped, about 17 kB in savings!

Below is the code for converting an array of strings into a trie of nested arrays and a function to determine if a given string is valid:

const strings = [...];  // array of strings
const enocdedStrings = [];

strings.forEach(str => {
  str = str.padEnd(3, '`');
  let array = enocdedStrings;
  str.split('').forEach((char, i) => {
    // index is 1 indexed
    const index = char.charCodeAt(0) - 96;
    if (i < str.length - 1) {
      array[index] = array[index] || [];
      array = array[index];
    }
    else {
      array[index] = 1;
    }
  });
});

/**
 * Determine if a string is a valid entry in the trie.
 * @param {String} str - String to validate.
 * @return {Boolean}
 */
function isValid(str) {
  let array = enocdedStrings;
  str = str.padEnd(3, '`');

  for (let i = 0; i <= str.length - 1; i++) {
    const index = str.charCodeAt(i) - 96;
    array = array[index];
    if (!array) {
      return false;
    }
  }

  return true;
}