How I developed the fastest XML parser
It was almost 5 years ago that I read the documentation of react.js. The docs were full of JSX and its JS equivalent. I saw this can work, but I thought people are using template engines such as jade and ejs for a long time, every express js tutorial showed how to use the templates. As React, they also take some input and generate the same UI output every time. Dropping the HTML string into the DOM, however, is not as elegant as react’s DOM reconciliation process. That process is needed to keep the elements/components state like user inputs or to do transitions of elements.
This made me work on a framework I called treeact. A framework to do the reconciliation for HTML strings. Experimenting with the browsers build-in parser, I found that parsing was fast enough, but working with the resulting objects from the DOMnode class hierarchy was not. And that parser is also not available within a web worker. It was a big idea at the time, bring app logic to a web worker, and only do UI updates in the main process. (that idea never really caught on, and angular react/redux, vue, svelte apps are all running all js in the main threat).
Searching for parsers that could run in the browser, I did not find anything fast enough. Sax and xml2js were just too slow for a mobile browser to guarantee a smooth user experience.
So I developed my own parser and made it a separate project. Analyzing pages like GitHub, StackOverflow, that have a very busy UI with lots of elements, I figured I need to parse 3000 elements in 20ms on mobile. At that time my galaxy S3.
So I needed a lot of experimenting to get the fastest possible results. Finding shortcuts and the fastest way to work with strings.
What makes TXML so fast?
Before I describe that in detail, you need to know what is compared. Txml is a DOM parser. It parses the XML string and creates a tree structure of objects that represent the content of the XML. Parsing the same content results in equal results. Different parsers return slightly different DOM representations, but all share this requirement. txml return arrays of string and node-objects, where each node has tagName, attributes, and children.
- Something I found quickly is that
regular expressions are too slow
. Your code can become very small, but it will not be fast. I found an incredible small xml parser within the alasql module. (not published on npm separately). So if every byte counts, this is maybe the smallest XML parser. - It was clear to me, I can not slice and substring too much. Because we are constantly allocating new memory. The solution was to
use a position pointer
, the index within the source string. - Keep the number of
function calls down
, during the parsing process. because every function calls create a scope object. There is a function to parse a node with its attributes, a function for name identifiers, a function for strings, and a function for a node list. I experimented with inlining everything to a single function, but the code got too unreadable with repetitions and the gain was too little. Somewhere I read that firefox SpiderMonkey is better optimized for loops and v8 more for function calls, I was able to see that SpiderMonkey profits more from inlining functions, but still too little. .indexOf is your friend
. With indexOf you can go through a string very quickly. You constantly need to find the next open and close brackets. It is running in v8 with a very quick native code.- Only parse the
parts of interest
. txml was made for fast UI rendering, not to test if the developer follows the XML specification. For example, closing tags are almost skipped. They begin with</
and end with>
. Do you like to write other crap into the closing tag? I don`t care. Most XML is generated by machines that will be well formatted. And if that is important to you, you most likely also want to validate against the XML schema. That is out of scope for txml. - Use of
.getCharcodeAt()
and compare the result to a number. That is much faster than comparing two single-character strings. - do the parsing inside its own js scope, to keep the scope for the actual parsing small and the needed code near where it was called. This allowed me to add more features, without any compromise and over time making the lib slower and slower.
- monomorphism, this is a trick utilized by Vue.j , angular, and react alike. It means the created nodes always have the same shape. They always have tagName, attributes, and children. The v8 js engine can do huge performance optimizations. and also your code can be cleaner as you don`t need a condition to check if a property is there.
These optimizations help to make txml faster than all other javascript XML parsers at the pure parsing process. These are the optimizations, that make txml 10-15 times faster than xml2js, and still 2-3 times faster than fast-xml-parser. These numbers are the results of benchmarks, that are part of fast-xml-parser and camaro.
But there is one more trick, that allows to find information inside a document even 100 times faster:
- Only parse the elements that are of interest. The user can provide a start position, where the content that should be parsed is located inside a bigger document. The module also has a helper for
getElementById
orgetElementsByClassName
. They can be called directly on the string, without parsing the entire document first.
While writing this article, I learned to know camaro, and even needed to delay the article to analyse the module and run the benchmark. camaro is fast due to c++ compilation to WASM, (however not as fast as txml and no support for streams.) camaro is also fast due to the use of piscina. Piscina can run processing intensive tasks inside a worker process potentially in parallel. txml can also profit from piscina, but it should be used by the user of txml, to reduce the amount of data that needs to transfer between processes.
While developing txml, I learned a lot about performance in javascript and that parsing some data using just javascript, without regex or other libraries is not as hard, but the resulting code is pretty fast.
By the way, txml is not only fast, but also reliable and secure. txml will only parse the xml, not interpret any information and load external information or execute external commands, during the process. Txml has been used, to parse data from open street map’s planet file, ARML for Augmented Reality, geoTiff, RSS, WebDAV, SOAP and other API responses, HTML, SVG, Web Scraping and more.
By the way, I think whenever you have the chance, please choose JSON over XML, it is just so much simpler.