update txml to version 4.0.1 for parsing office documents

Today I found a new issue for txml the small fast pure javascript xml parser. The thing is, that I saw no value in keeping whitespace. However, I had to learn different. Today I was parsing asked to keep spaces within a document.

In order to reproduce the problem, I had opened up wikipedia and saw, that .odf as well as .docx files are basically zip files, containing some defined contents. Such as an xml defining the content, and all the media files embedded to the document. Here is the code, that I came up with, to open and manipulate an odf file from libre office:

var fs = require("fs");
var JSZip = require("jszip"); // reading the zip file
const txml = require('txml'); // use txml to manipulate the content


// read a zip file into a buffer
fs.readFile("doc.odt", async function(err, data) {
    if (err) throw err;
    // parse the zip data
    const zip = await JSZip.loadAsync(data);
    
    // explore the files with some logging
    const files = Object.keys(zip.files);
    // console.log(zip.files['content.xml'])
    
    // get the content of the document
    const textContent = await zip.file("content.xml").async("string");
    const content = txml.parse(textContent);
    console.log(textContent)
    console.log(files);
    // console.log(JSON.stringify(txml.parse(content)[1].children,undefined, '  '))
    const officeText = txml.filter(content,(node)=>node.tagName=='office:text')[0];
    // append a new paragraph with new text
    officeText.children.push({
      "tagName": "text:p",
      "attributes": {
        "text:style-name": "P1"
      },
      "children": [
        "new Text"
      ]
    });

    // stringify the content and:
    // note that with version 4.0.1 the hack of closing of the `?xml` tag is no longer needed
    const newFileContent = txml.stringify(content).split('></?xml>').join('?>\n');
    console.log(newFileContent);

    // Write the content back to the zip file
    zip.file("content.xml",  newFileContent, {string: true});
    const updatedZip = await zip.generateAsync({ type: "nodebuffer" });
    fs.writeFileSync('docUpdated.odt',updatedZip);
});

Then I found that I do not have microsoft office installed. But I had wordpad. It also can open and save odf as well as docx files. So I created a docx file with it and did the same:

var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')

fs.readFile("wordpad.docx", async function(err, data) {
    if (err) throw err;
    const zip = await JSZip.loadAsync(data);
    
    // console.log(Object.keys(zip.files))
    const textContent = await zip.file("word/document.xml").async("string");
    const content = txml.parse(textContent);
    // console.log(textContent);
    // console.log(JSON.stringify(content,undefined,'  '));
    const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
    zip.file("content.xml", newContent, {string: true});
    const updatedZip = await zip.generateAsync({type: "nodebuffer"});
    fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
});

When I stringify the new content, I found the new content to work well, however it was a bit bigger. That is because it stringifies always with proper closing tags. </...>

With this test, I was also testing the problem described in at the issue. And I found that spaces typed into the document did not end up in the document written by my node program. So that was fixed, by making a new version for txml and a new argument keepWhitespace.

For this I really want to thank moongazers. In the issue he described the problem clearly and even pointed me already to the correct position in the code.

back to developer log

Tobias Nickel

software developer from heart

update txml to version 4.0.1 for parsing office documents