Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CiceroMark OOXML transform #247

Open
dselman opened this issue Jul 7, 2020 · 4 comments
Open

CiceroMark OOXML transform #247

dselman opened this issue Jul 7, 2020 · 4 comments
Labels
Difficulty: Medium Good First Issue :octocat: Good for newcomers Help Wanted 🆘 Extra attention is needed Type: Feature Request 🛍️ New feature or request

Comments

@dselman
Copy link
Contributor

dselman commented Jul 7, 2020

The existing DOCX support is partial and is poor quality with many real-world DOCX files. It would be preferable to have a first-class bidirectional transformation from CiceroMark to/from OOXML.

Preferred solution

Integrate an OOXML <-> CiceroMark transform into the project.

Alternatives

We currently use a 3rd-party library to do DOCX -> Markdown transformation, which has a number of issues.

See: #144

Additional context

Accord Project Schemas:

Mapping Table

  • Document -> w:document
  • Paragraph -> w:p
  • Text -> w:t
    • ?? -> w:tab
    • ?? -> w: noBreakHyphen
    • ?? -> w: softHyphen
  • Linebreak -> w:br
  • Softbreak -> w:cr (?)
  • List -> w:numbering
  • ListItem -> w:num
  • Strong -> w:b
  • Emph -> w:i
  • Variable -> w:sdt (content control)
  • Heading -> ?? (infer from style?)
  • Link -> w:hyperlink
@dselman
Copy link
Contributor Author

dselman commented Sep 21, 2020

@DianaLease @irmerk what is the status of this please? Is there something I can do?

@jolanglinais
Copy link
Member

The work for supporting this transform is captured in the algoo-ooxml branch.

@algomaster99 are you able to update on this?

@algomaster99
Copy link
Member

The branch algoo-ooxml currently only comprises of OOXML -> CiceroMark transformer and it has only been perfected for [email protected].

Currently parsed entities

It transfers the following OOXML entities into CiceroMark:

  1. There are two types of w:p. One is a heading the other is an actually paragraph. It is decided by the w:pStyle attribute.
      <w:p w:rsidR="009D4C12" w:rsidRDefault="009D4C12">
        <w:pPr>
          <w:pStyle w:val="Heading2"/>
        </w:pPr>
        <w:r>
          <w:rPr>
            <w:sz w:val="40"/>
          </w:rPr>
          <w:t>Acceptance of Delivery.</w:t>
        </w:r>
      </w:p>
    to
    {
    "$class": "org.accordproject.commonmark.Heading",
    "level": "2",
    "nodes": [
      {
        "$class": "org.accordproject.commonmark.Text",
        "text": "Acceptance of Delivery."
      }
    ]
    },
  2. Variable
    <w:sdt>
      <w:sdtPr>
        <w:rPr>
          <w:color w:val="000000"/>
          <w:sz w:val="24"/>
          <w:highlight w:val="green"/>
        </w:rPr>
        <w:alias w:val="Shipper1 | org.accordproject.organization.Organization"/>
        <w:tag w:val="shipper"/>
        <w:id w:val="1083948321"/>
        <w15:webExtensionLinked/>
      </w:sdtPr>
      <w:sdtContent>
        <w:r>
          <w:rPr>
            <w:color w:val="000000"/>
            <w:sz w:val="24"/>
            <w:highlight w:val="green"/>
          </w:rPr>
          <w:t>"Party A"</w:t>
        </w:r>
      </w:sdtContent>
    </w:sdt>
    to this
    {
      "$class": "org.accordproject.ciceromark.Variable",
      "value": "\"Party A\"",
      "name": "shipper",
      "elementType": "org.accordproject.organization.Organization"
    },

More entities include the org.accordproject.commonmark.Text and org.accordproject.commonmark.Softbreak. Refer to the cases here to understand how it processes the OOXML.

What is the input to the parser?

This function initiates the transformation of OOXML -> CiceroMark. The OOXML is very long and we only need content under this block - <pkg:part pkg:name="/word/document.xml". This is where all the content of the document resides.

Test by running the test suite. The OOXML it processes is fetched from the document and it gets converted to a CiceroMark representation.

CiceroMark -> OOXML

This is directly done in the cicero-word-add-in repo. The source code can be found here.

@K-Kumar-01
Copy link
Collaborator

@dselman @algomaster99
I have created a new issue depicting the implemented and left transformations. Let me know if there is anything to add.
The issue is mentioned here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Medium Good First Issue :octocat: Good for newcomers Help Wanted 🆘 Extra attention is needed Type: Feature Request 🛍️ New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants