Defines the html2xhtml() function which converts HTML to valid XHTML.

It uses Jason Stitt's pytidylib module. This module requires the HTML Tidy library to be installed on the system:

$ sudo apt-get install tidy

Some examples:

>>> print(html2xhtml('''\
... <p>Hello,&nbsp;world!<br>Again I say: Hello,&nbsp;world!</p>
... <img src="" alt="Foo">'''))
<p>Hello,&nbsp;world!<br />
Again I say: Hello,&nbsp;world!</p>
<img src="" alt="Foo" />

Above test is currently skipped because tidylib output can slightly differ (alt="Foo"> versus alt="Foo" >) depending on the installed version of tidylib.

>>> html = '''\
... <p style="font-family: &quot;Verdana&quot;;">Verdana</p>'''
>>> print(html2xhtml(html))
<p style="font-family: &quot;Verdana&quot;;">Verdana</p>
>>> print(html2xhtml('A &amp; B'))
A &amp; B
>>> print(html2xhtml('a &lt; b'))
a &lt; b

A <div> inside a <span> is not valid XHTML. Neither is a <li> inside a <strong>.

But how to convert it? Inline tags must be "temporarily" closed before and reopended after a block element.

>>> print(html2xhtml('<p>foo<span class="c">bar<div> oops </div>baz</span>bam</p>'))
<p>foo<span class="c">bar</span></p>
<div><span class="c">oops</span></div>
<span class="c">baz</span>bam
>>> print(html2xhtml('''<strong><ul><em><li>Foo</li></em><li>Bar</li></ul></strong>'''))

In HTML it was tolerated to not end certain tags. For example, a string "<p>foo<p>bar<p>baz" converts to "<p>foo</p><p>bar</p><p>baz</p>".

>>> print(html2xhtml('<p>foo<p>bar<p>baz'))

(This module's source code is available here.)


html2xhtml(html, **options)