lino.utils.html2xhtml

Defines the html2xhtml() function which converts HTML to valid XHTML.

It uses Jason Stitt's pytidylib module. This module requires the HTML Tidy library to be installed on the system:

$ sudo apt-get install tidy

Some examples:

>>> print(html2xhtml('''\
... <p>Hello,&nbsp;world!<br>Again I say: Hello,&nbsp;world!</p>
... <img src="foo.org" alt="Foo">'''))
... 
<p>Hello,&nbsp;world!<br />
Again I say: Hello,&nbsp;world!</p>
<img src="foo.org" alt="Foo" />

Above test is currently skipped because tidylib output can slightly differ (alt="Foo"> versus alt="Foo" >) depending on the installed version of tidylib.

>>> html = '''\
... <p style="font-family: &quot;Verdana&quot;;">Verdana</p>'''
>>> print(html2xhtml(html))
<p style="font-family: &quot;Verdana&quot;;">Verdana</p>
>>> print(html2xhtml('A &amp; B'))
A &amp; B
>>> print(html2xhtml('a &lt; b'))
a &lt; b

A <div> inside a <span> is not valid XHTML. Neither is a <li> inside a <strong>.

But how to convert it? Inline tags must be "temporarily" closed before and reopended after a block element.

>>> print(html2xhtml('<p>foo<span class="c">bar<div> oops </div>baz</span>bam</p>'))
<p>foo<span class="c">bar</span></p>
<div><span class="c">oops</span></div>
<span class="c">baz</span>bam
>>> print(html2xhtml('''<strong><ul><em><li>Foo</li></em><li>Bar</li></ul></strong>'''))
<ul>
<li><strong><em>Foo</em></strong></li>
<li><strong>Bar</strong></li>
</ul>

In HTML it was tolerated to not end certain tags. For example, a string "<p>foo<p>bar<p>baz" converts to "<p>foo</p><p>bar</p><p>baz</p>".

>>> print(html2xhtml('<p>foo<p>bar<p>baz'))
<p>foo</p>
<p>bar</p>
<p>baz</p>

(This module's source code is available here.)

Functions

html2xhtml(html, **options)