HTML parsing support for xslt module

Mon Mar 19 17:03:27 UTC 2012

Hello!

On Wed, Mar 14, 2012 at 05:03:56PM +0000, Laurence Rowe wrote:

> I'd like to submit the attached patch which implements an
> ``xslt_html_parser`` directive for consideration. When enabled, the
> xslt module uses the libxml2 HTMLParser to parse the response body.
> This is useful for people who want to transform HTML using XSLT,
> including users of Diazo deploying on Nginx
> (http://docs.diazo.org/en/latest/deployment.html#nginx).
> 
> The patch is generated from my repository at
> https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
> http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
> http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
> is included in my repository. I'll discuss each of the individual
> changesets briefly below:
> 
> changeset:   668:bf4d14f51436
> user:        Laurence Rowe <laurence at lrowe.co.uk>
> date:        Sun Jul 11 23:54:08 2010 +0100
> summary:     Skip transform when there is no content (e.g. a proxied redirect)
> 
> This was originally reviewed in
> http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:
> 
> > Currently the way to disable XSLT processing is MIME type, "text/xml"
> > by default. Redirects usually have "text/html" type.
> 
> To parse HTML you have to set ``xslt_types text/html;``. This
> changeset prevents crashing on responses with an empty body.

I don't think that skipping transformation based on 
content_length_n is a correct behaviour.

1) There are more than one way to represent empty response, and 
content_length_n == 0 is just one of them.

2) And the empty response is just another case of malformed input, 
I see no reason to handle this specially.

> changeset:   669:9487b3a0e3ff
> user:        Laurence Rowe <laurence at lrowe.co.uk>
> date:        Fri Mar 09 21:01:25 2012 +0000
> summary:     Use xmlCtxtUseOptions to set options.
> 
> This changeset moves most option setting to the xmlCtxtUseOptions
> (foundational to next commit.)

This doesn't produce identical ctxt->loadsubset, and hence need 
more details.  I tend to think the current code is wrong and 
your's is ok, though it would be good to see confirmation.

(Additionally, there is a style issue in this patch.)

> changeset:   670:c8349ca87381
> user:        Laurence Rowe <laurence at lrowe.co.uk>
> date:        Fri Mar 09 21:08:18 2012 +0000
> summary:     XML_PARSE_COMPACT to save memory
> 
> We can use XML_PARSE_COMPACT as the parsed input document is not
> modified (XSLT creates a new result document.)

Is it supported in libxml2 versions used by various OSes now?  It 
appeared relatively recently, in libxml2 2.6.21, and if some OSes 
still use older versions by default - this may need configure 
test.

Another question to consider - is it actually safe to use?  Note 
that nginx does modify input document to provide DTDs.

It would be also cool to see some stats on real world memory usage 
with and without this option set.

> changeset:   671:7145bd8cc1e2
> tag:         tip
> user:        Laurence Rowe <laurence at lrowe.co.uk>
> date:        Fri Mar 09 21:47:46 2012 +0000
> summary:     xslt_html_parser
> 
> This changeset adds the ``xslt_html_parser`` directive and uses the
> HTMLParser when it is set. HTML parsing is performed with
> HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
> handling is thus disabled when this option is set.

I in general like the feature, though I tend to think this patch 
needs more polishing.  In no particular order:

1) Hardcoded "utf-8" charset detection scares me.

2) The are no error handling for html code, and while it's 
probably ok to ignore parsing problems - it's certainly not ok to 
ignore fatal problems like memory allocation ones.  

3) DTD and entities handling needs clarification.

Maxim Dounin

p.s. You may want to use patchbomb mercurial extension (hg email) 
to submit patches.