An extremely fast streaming SAX parser for Node.js
Posted on June 1, 2020 • 3 minutes • 483 words
TLDR: I wrote a SAX parser for Node.js. It’s available here on GitHub : https://github.com/tuananh/sax-parser
I got asked about complete XML parsing with camaro
from time to time and I haven’t yet managed to find time to implement yet.
Initially I thought it should be part of camaro
project but now I think it would make more sense as a separate package.
The package is still in alpha state and should not be used in production but if you want to try it, it’s available on npm as <code>@tuananh/sax-parser</code> .
Benchmark
The initial benchmark looks pretty good. I just extract the benchmark script from node-expat
repo and add few more contenders.
sax x 14,277 ops/sec ±0.73% (87 runs sampled)
@tuananh/sax-parser x 45,779 ops/sec ±0.85% (85 runs sampled)
node-xml x 4,335 ops/sec ±0.51% (86 runs sampled)
node-expat x 13,028 ops/sec ±0.39% (88 runs sampled)
ltx x 81,722 ops/sec ±0.73% (89 runs sampled)
libxmljs x 8,927 ops/sec ±1.02% (88 runs sampled)
Fastest is ltx
ltx
package is fastest, win by almost 2 (~1.8) order of magnitude compare with the second fastest (@tuananh/sax-parser). However, ltx
is not fully compliant with XML spec. I still include ltx
here for reference. If ltx
works for you, use it.
module | ops/sec | native | XML compliant | stream |
---|---|---|---|---|
node-xml | 4,335 | ☐ | ✘ | ✘ |
libxmljs | 8,927 | ✘ | ✘ | ☐ |
node-expat | 13,028 | ✘ | ✘ | ✘ |
sax | 14,277 | ☐ | ✘ | ✘ |
@tuananh/sax-parser | 45,779 | ✘ | ✘ | ✘ |
ltx | 81,722 | ☐ | ☐ | ✘ |
API
The API looks simply enough and quite familiar with other SAX parsers. In fact, I took the inspiration from them (sax
and node-expat
) and mostly copied their APIs to make the transition easier.
An example of using @tuananh/sax-parser
to prettify XML would be like this
const { readFileSync } = require('fs')
const SaxParser = require('@tuananh/sax-parser')
const parser = new SaxParser()
let depth = 0
parser.on('startElement', (name) => {
let str = ''
for (let i = 0; i < depth; ++i) str += ' ' // indentation
str += `<${name}>`
process.stdout.write(str + '\n')
depth++
})
parser.on('text', (text) => {
let str = ''
for (let i = 0; i < depth + 1; ++i) str += ' ' // indentation
str += text
process.stdout.write(str + '\n')
})
parser.on('endElement', (name) => {
depth--
let str = ''
for (let i = 0; i < depth; ++i) str += ' ' // indentation
str += `<${name}>`
process.stdout.write(str + '\n')
})
parser.on('startAttribute', (name, value) => {
// console.log('startAttribute', name, value)
})
parser.on('endAttribute', () => {
// console.log('endAttribute')
})
parser.on('cdata', (cdata) => {
let str = ''
for (let i = 0; i < depth + 1; ++i) str += ' ' // indentation
str += `<![CDATA[${cdata}]]>`
process.stdout.write(str)
process.stdout.write('\n')
})
parser.on('comment', (comment) => {
process.stdout.write(`<!--${comment}-->\n`)
})
parser.on('doctype', (doctype) => {
process.stdout.write(`<!DOCTYPE ${doctype}>\n`)
})
parser.on('startDocument', () => {
process.stdout.write(`<!--=== START ===-->\n`)
})
parser.on('endDocument', () => {
process.stdout.write(`<!--=== END ===-->`)
})
const xml = readFileSync(__dirname + '/../benchmark/test.xml', 'utf-8')
parser.parse(xml)