Please add HTML support for http_xslt_module (there's an nginx fork which has it already)

Laurence Rowe l at lrowe.co.uk
Fri Mar 9 23:21:52 UTC 2012


On 9 March 2012 06:20, Peter Halasz <list at pengo.org> wrote:
> Hi devs,
>
> I work for an environmental not-for-profit organisation where we use
> XSLT to theme our website. (The XSLT is generated by Diazo, and the
> site largely runs on Plone).
>
> Currently we use Nginx to do the XSLT transformation. There's a
> problem though, that our un-themed site doesn't come out as perfect
> XML, so we need an XSLT parser which can transform HTML (not just
> XML). Nginx's http_xslt_module does NOT currently support HTML
> parsing, and I'd really like to see this feature added.
>
> The problem isn't specific to Diazo, but the Diazo manual explains the
> need for HTML parsing:
>
>> In theory, any XSLT processor will do. In practice, however, most websites do not produce 100% well-formed XML (i.e. they do not conform to the XHTML “strict” doctype). For this reason, it is normally necessary to use an XSLT processor that will parse the content using a more lenient parser with some knowledge of HTML. libxml2, the most popular XML processing library on Linux and similar operating systems, contains such a parser.
>
> Fortunately there's a fork of nginx which does use libxml2: the
> xslt_html project <http://code.google.com/p/html-xslt/>.
> Unfortunately, the project is not maintained, so it ties us to a
> patched version of nginx 0.7.67 (circa June 2010). I'd like to upgrade
> nginx -- I've hit nginx bugs that were fixed long ago. I'm sure there
> are many other nginx users with the same needs, so I'm requesting the
> fork's changes make their way into the mainline. I'm assuming it's
> just been forgotten.
>
> The Diazo documentation also explains deploying with this patched Nginx:
>
>> To deploy an Diazo theme to the Nginx web server, you will need to compile Nginx with a special version of the XSLT module that can (optionally) use the HTML parser from libxml2.
>
>> In the future, the necessary patches to enable HTML mode parsing will hopefully be part of the standard Nginx distribution. In the meantime, they are maintained in the html-xslt project.
>
> We're using this html-xslt fork of nginx at my organisation. But
> unfortunately, it's not maintained, and the functionality hasn't made
> it into the standard Nginx distribution. Can we please include it?
>
> The fork adds the directive: "xslt_html_parser on;" which causes the
> http_xslt_module to parse in HTML mode.
>
> I've just made a diff <http://pastebin.com/CP1P8Gzj> to see what the
> fork changes, and it's 755 lines long. (That's a bit longer than I
> expected)
>
> The files modified by the html-xslt fork are:
>
>       src/http/modules/ngx_http_xslt_filter_module.c
>       src/http/ngx_http_variables.c
>       auto/options
>       auto/lib/libxslt/conf
>
> The diff is against nginx 0.7.67. Since then the
> ngx_http_xslt_filter_module.c has seen about 300 lines removed and 20
> lines added or changed, so obviously the diff can't be used as a patch
> against the current version of nginx.
>
> Hopefully that's more than enough info to get started if developers
> are interested in folding the fork into nginx.
>
> I know the other solution to our problem here is to move the XSLT to
> another layer of the stack -- such as Varnish or Apache -- but I want
> to make sure nginx devs know about the feature they're missing first.
>
> Thanks for listening and I hope HTML parsing for XSLT can make it to
> the mainline of nginx,

Thanks for bringing this up, I needed a bit of encouragement to get
back to this!

Some of the foundational components of the patch were merged, see:
http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html

I've updated the patch to the current version here:
https://bitbucket.org/lrowe/nginx-xslt-html-parser. It includes the
recent xslt_param work from:
http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html.
I've left out the autoconf changes supporting custom libxml2/libxslt
builds, they didn't work terribly well anyway.

The only significant omissions/changes from a Diazo perspective (when
compared to the nginx 0.7 patch) are:

* The lack of automatic uri unescaping of xslt parameters. This was
always a hack. If you need the SSI filter_xpath stuff from
http://docs.diazo.org/en/latest/deployment.html#including-external-content-with-ssi
then you will need to unquote the parameter value using inline perl
(didn't build on my Mac) or set_uri_unescape from
https://github.com/agentzh/set-misc-nginx-module.

* As with the current with standard nginx xslt module, responses with
a text/xml mime type will always have the stylesheet applied (and
parsed with the HTMLParser if that is switched on.) I'm not sure what
the best way to handle this is without breaking backwards
compatibility. We need something like an `xslt_unregister_types`
directive, but I'm not sure how to go about implementing that. Or
maybe a variable to set to disable xslt application.

* HTML documents are now no longer assumed to always have a utf-8
charset. You should ensure the charset is set on the response if the
HTML is utf-8 encoded, otherwise it leaves it up to the autodetection
that normally assumed Latin-1.

Anyway, I'd appreciate some testing and assuming it works I'll submit
the patches to this list.

Laurence



More information about the nginx-devel mailing list