lino.utils.html2odf

This module contains mainly a utility function html2odf() which converts an ElementTree object generated using etgen.html to a fragment of ODF.

This is not trivial. The challenge is that HTML and ODF are quite different document representations. But something like this seems necessary. Lino uses it in order to generate .odt documents which contain (among other) chunks of html that have been entered using TinyMCE and stored in database fields.

TODO: is there really no existing library for this task? I saw approaches which call libreoffice in headless mode to do the conversion, but this sounds inappropriate for our situation where we must glue together fragments from different sources. Also note that we use appy.pod to do the actual generation.

Usage examples:

>>> from etgen.html import E, tostring
>>> def test(e):
...     print (tostring(e))
...     print (toxml(html2odf(e)))
>>> test(E.p("This is a ", E.b("first"), " test."))
... 
<p>This is a <b>first</b> test.</p>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">This
is a <text:span text:style-name="Strong Emphasis">first</text:span>
test.</text:p>
>>> test(E.p(E.b("This")," is another test."))
... 
<p><b>This</b> is another test.</p>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span
text:style-name="Strong Emphasis">This</text:span> is another test.</text:p>
>>> test(E.p(E.strong("This")," is another test."))
... 
<p><strong>This</strong> is another test.</p>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span 
text:style-name="Strong Emphasis">This</text:span> is another test.</text:p>
>>> test(E.p(E.i("This")," is another test."))
... 
<p><i>This</i> is another test.</p>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span
text:style-name="Emphasis">This</text:span> is another test.</text:p>
>>> test(E.td(E.p("This is another test.")))
... 
<td><p>This is another test.</p></td>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">This
is another test.</text:p>
>>> test(E.td(E.p(E.b("This"), " is another test.")))
... 
<td><p><b>This</b> is another test.</p></td>
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span
text:style-name="Strong Emphasis">This</text:span> is another test.</text:p>
>>> test(E.ul(E.li("First item"),E.li("Second item")))
... 
<ul><li>First item</li><li>Second item</li></ul>
<text:list xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" 
text:style-name="podBulletedList"><text:list-item><text:p 
text:style-name="podBulletItem">First item</text:p></text:list-item><text:list-item><text:p 
text:style-name="podBulletItem">Second item</text:p></text:list-item></text:list>

N.B.: the above chunk is obviously not correct since Writer doesn't display it. (How can I debug a generated odt file? I mean if my content.xml is syntactically valid but Writer ...) Idea: validate it against the ODF specification using lxml

Here is another HTML fragment which doesn't yield a valid result:

>>> from lxml import etree
>>> html = '<td><div><p><b>Bold</b></p></div></td>'
>>> print(toxml(html2odf(etree.fromstring(html))))
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"/>

html2odf() converts bold text to a span with a style named "Strong Emphasis". That's currently a hard-coded name, and the caller must make sure that a style of that name is defined in the document.

The text formats <i> and <em> are converted to a style "Emphasis".

Edge case:

>>> print (toxml(html2odf("Plain string")))
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">Plain string</text:p>
>>> print (toxml(html2odf(u"Ein schöner Text")))
<text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">Ein schöner Text</text:p>

Not yet supported

The following is an example for #788. Conversion fails if a sequence of paragraph-level items are grouped using a div:

>>> test(E.div(E.p("Two numbered items:"),
...    E.ol(E.li("first"), E.li("second"))))
... 
Traceback (most recent call last):
...
IllegalText: The <text:section> element does not allow text
>>> from lxml import etree
>>> test(etree.fromstring('<ul type="disc"><li>First</li><li>Second</li></ul>'))
<ul type="disc"><li>First</li><li>Second</li></ul>
<text:list xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" text:style-name="podBulletedList"><text:list-item><text:p text:style-name="podBulletItem">First</text:p></text:list-item><text:list-item><text:p text:style-name="podBulletItem">Second</text:p></text:list-item></text:list>
>>> test(E.p(E.dl(E.dt("Foo"), E.dl("A foobar without bar."))))
Traceback (most recent call last):
...
NotImplementedError: <dl> inside <text:p>

(This module's source code is available here.)

Functions

html2odf(e[, ct])

Convert a etgen.html element to an ODF text element.

toxml(node)

Convert an ODF node to a string with its XML representation.

lino.utils.html2odf.toxml(node)

Convert an ODF node to a string with its XML representation.

lino.utils.html2odf.html2odf(e, ct=None, **ctargs)

Convert a etgen.html element to an ODF text element. Most formats are not implemented. There's probably a better way to do this...

Ct

the root element ("container"). If not specified, we create one.