HTML parsing support for xslt module

Tue Mar 20 16:05:42 UTC 2012

Hi Maxim,

Thanks for your review. I've made a few comments / requests for
clarification below and will pull them together into a new patch set
over the week or so.

On 19 March 2012 17:03, Maxim Dounin <mdounin at mdounin.ru> wrote:
> Hello!
>
> On Wed, Mar 14, 2012 at 05:03:56PM +0000, Laurence Rowe wrote:
>
>> I'd like to submit the attached patch which implements an
>> ``xslt_html_parser`` directive for consideration. When enabled, the
>> xslt module uses the libxml2 HTMLParser to parse the response body.
>> This is useful for people who want to transform HTML using XSLT,
>> including users of Diazo deploying on Nginx
>> (http://docs.diazo.org/en/latest/deployment.html#nginx).
>>
>> The patch is generated from my repository at
>> https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
>> http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
>> http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
>> is included in my repository. I'll discuss each of the individual
>> changesets briefly below:
>>
>> changeset:   668:bf4d14f51436
>> user:        Laurence Rowe <laurence at lrowe.co.uk>
>> date:        Sun Jul 11 23:54:08 2010 +0100
>> summary:     Skip transform when there is no content (e.g. a proxied redirect)
>>
>> This was originally reviewed in
>> http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:
>>
>> > Currently the way to disable XSLT processing is MIME type, "text/xml"
>> > by default. Redirects usually have "text/html" type.
>>
>> To parse HTML you have to set ``xslt_types text/html;``. This
>> changeset prevents crashing on responses with an empty body.
>
> I don't think that skipping transformation based on
> content_length_n is a correct behaviour.
>
> 1) There are more than one way to represent empty response, and
> content_length_n == 0 is just one of them.
>
> 2) And the empty response is just another case of malformed input,
> I see no reason to handle this specially.

Is there a good example of a filter module doing correct error
handling to use here, perhaps ngx_http_sub_filter_module.c?

>> changeset:   669:9487b3a0e3ff
>> user:        Laurence Rowe <laurence at lrowe.co.uk>
>> date:        Fri Mar 09 21:01:25 2012 +0000
>> summary:     Use xmlCtxtUseOptions to set options.
>>
>> This changeset moves most option setting to the xmlCtxtUseOptions
>> (foundational to next commit.)
>
> This doesn't produce identical ctxt->loadsubset, and hence need
> more details.  I tend to think the current code is wrong and
> your's is ok, though it would be good to see confirmation.

In the current version of libxml2, loadsubset is a bit field. The
previous value of 1 does not match any of the loadsubset bit field
options (XML_DETECT_IDS=2, XML_COMPLETE_ATTRS=4, XML_SKIP_IDS=8) but
from reading the source, XML_DETECT_IDS is not tested for directly in
any relevant files, with SAX2.c using the test (ctxt->loadsubset !=
0). The loadsubset value is set to XML_DETECT_IDS by the
xmlCtxtUseOptions call with XML_PARSE_DTDLOAD.

I guess it must have been a change sometime in the distant past to
make it a bit field, but it was definitely a bit field as far back as
LIBXML2_2_6_21 from 2005.

> (Additionally, there is a style issue in this patch.)

Is there a style guide for nginx code? I couldn't find an example of
how a long bitfield should be split across lines.

>> changeset:   670:c8349ca87381
>> user:        Laurence Rowe <laurence at lrowe.co.uk>
>> date:        Fri Mar 09 21:08:18 2012 +0000
>> summary:     XML_PARSE_COMPACT to save memory
>>
>> We can use XML_PARSE_COMPACT as the parsed input document is not
>> modified (XSLT creates a new result document.)
>
> Is it supported in libxml2 versions used by various OSes now?  It
> appeared relatively recently, in libxml2 2.6.21, and if some OSes
> still use older versions by default - this may need configure
> test.
>
> Another question to consider - is it actually safe to use?  Note
> that nginx does modify input document to provide DTDs.
>
> It would be also cool to see some stats on real world memory usage
> with and without this option set.

The best benchmark I could find was here:
http://comments.gmane.org/gmane.comp.python.lxml.devel/1507.
Essentially it is helpful for documents with lots of short text nodes.

I hadn't noticed the document modification to provide DTDs. That makes
it unsafe to use and it should be dropped.

>> changeset:   671:7145bd8cc1e2
>> tag:         tip
>> user:        Laurence Rowe <laurence at lrowe.co.uk>
>> date:        Fri Mar 09 21:47:46 2012 +0000
>> summary:     xslt_html_parser
>>
>> This changeset adds the ``xslt_html_parser`` directive and uses the
>> HTMLParser when it is set. HTML parsing is performed with
>> HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
>> handling is thus disabled when this option is set.
>
> I in general like the feature, though I tend to think this patch
> needs more polishing.  In no particular order:
>
> 1) Hardcoded "utf-8" charset detection scares me.

Looks like xmlParseCharEncoding is the way to go here.
http://apache.webthing.com/mod_proxy_html30/mod_proxy_html-2.5.2.c

> 2) The are no error handling for html code, and while it's
> probably ok to ignore parsing problems - it's certainly not ok to
> ignore fatal problems like memory allocation ones.

I'll add back the fatal error handler and add an off switch to it.

> 3) DTD and entities handling needs clarification.

Anything other to the XML_PARSE_COMPACT concern noted above? The DTD
is ignored by the HTMLParser, so I do not set
ctxt->sax->externalSubset when parsing html.

> Maxim Dounin
>
> p.s. You may want to use patchbomb mercurial extension (hg email)
> to submit patches.

I'll experiment with this for the next version of the patch.

Laurence