CiceroMark OOXML transform #247

dselman · 2020-07-07T10:42:56Z

The existing DOCX support is partial and is poor quality with many real-world DOCX files. It would be preferable to have a first-class bidirectional transformation from CiceroMark to/from OOXML.

Preferred solution

Integrate an OOXML <-> CiceroMark transform into the project.

Alternatives

We currently use a 3rd-party library to do DOCX -> Markdown transformation, which has a number of issues.

See: #144

Additional context

Accord Project Schemas:

Mapping Table

Document -> w:document
Paragraph -> w:p
Text -> w:t
- ?? -> w:tab
- ?? -> w: noBreakHyphen
- ?? -> w: softHyphen
Linebreak -> w:br
Softbreak -> w:cr (?)
List -> w:numbering
ListItem -> w:num
Strong -> w:b
Emph -> w:i
Variable -> w:sdt (content control)
Heading -> ?? (infer from style?)
Link -> w:hyperlink

dselman · 2020-09-21T08:18:26Z

@DianaLease @irmerk what is the status of this please? Is there something I can do?

jolanglinais · 2020-09-21T14:40:20Z

The work for supporting this transform is captured in the algoo-ooxml branch.

@algomaster99 are you able to update on this?

algomaster99 · 2020-09-22T15:29:22Z

The branch algoo-ooxml currently only comprises of OOXML -> CiceroMark transformer and it has only been perfected for [email protected].

Currently parsed entities

It transfers the following OOXML entities into CiceroMark:

There are two types of w:p. One is a heading the other is an actually paragraph. It is decided by the w:pStyle attribute.

  <w:p w:rsidR="009D4C12" w:rsidRDefault="009D4C12">
    <w:pPr>
      <w:pStyle w:val="Heading2"/>
    </w:pPr>
    <w:r>
      <w:rPr>
        <w:sz w:val="40"/>
      </w:rPr>
      <w:t>Acceptance of Delivery.</w:t>
    </w:r>
  </w:p>

to

{
"$class": "org.accordproject.commonmark.Heading",
"level": "2",
"nodes": [
  {
    "$class": "org.accordproject.commonmark.Text",
    "text": "Acceptance of Delivery."
  }
]
},

Variable

<w:sdt>
  <w:sdtPr>
    <w:rPr>
      <w:color w:val="000000"/>
      <w:sz w:val="24"/>
      <w:highlight w:val="green"/>
    </w:rPr>
    <w:alias w:val="Shipper1 | org.accordproject.organization.Organization"/>
    <w:tag w:val="shipper"/>
    <w:id w:val="1083948321"/>
    <w15:webExtensionLinked/>
  </w:sdtPr>
  <w:sdtContent>
    <w:r>
      <w:rPr>
        <w:color w:val="000000"/>
        <w:sz w:val="24"/>
        <w:highlight w:val="green"/>
      </w:rPr>
      <w:t>"Party A"</w:t>
    </w:r>
  </w:sdtContent>
</w:sdt>

to this

{
  "$class": "org.accordproject.ciceromark.Variable",
  "value": "\"Party A\"",
  "name": "shipper",
  "elementType": "org.accordproject.organization.Organization"
},

More entities include the org.accordproject.commonmark.Text and org.accordproject.commonmark.Softbreak. Refer to the cases here to understand how it processes the OOXML.

What is the input to the parser?

This function initiates the transformation of OOXML -> CiceroMark. The OOXML is very long and we only need content under this block - <pkg:part pkg:name="/word/document.xml". This is where all the content of the document resides.

Test by running the test suite. The OOXML it processes is fetched from the document and it gets converted to a CiceroMark representation.