HTML parsing support for xslt module

Wed Mar 21 12:55:24 UTC 2012

Hello!

On Tue, Mar 20, 2012 at 04:05:42PM +0000, Laurence Rowe wrote:

> Hi Maxim,
> 
> Thanks for your review. I've made a few comments / requests for
> clarification below and will pull them together into a new patch set
> over the week or so.
> 
> On 19 March 2012 17:03, Maxim Dounin <mdounin at mdounin.ru> wrote:
> > Hello!
> >
> > On Wed, Mar 14, 2012 at 05:03:56PM +0000, Laurence Rowe wrote:
> >
> >> I'd like to submit the attached patch which implements an
> >> ``xslt_html_parser`` directive for consideration. When enabled, the
> >> xslt module uses the libxml2 HTMLParser to parse the response body.
> >> This is useful for people who want to transform HTML using XSLT,
> >> including users of Diazo deploying on Nginx
> >> (http://docs.diazo.org/en/latest/deployment.html#nginx).
> >>
> >> The patch is generated from my repository at
> >> https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
> >> http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
> >> http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
> >> is included in my repository. I'll discuss each of the individual
> >> changesets briefly below:
> >>
> >> changeset:   668:bf4d14f51436
> >> user:        Laurence Rowe <laurence at lrowe.co.uk>
> >> date:        Sun Jul 11 23:54:08 2010 +0100
> >> summary:     Skip transform when there is no content (e.g. a proxied redirect)
> >>
> >> This was originally reviewed in
> >> http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:
> >>
> >> > Currently the way to disable XSLT processing is MIME type, "text/xml"
> >> > by default. Redirects usually have "text/html" type.
> >>
> >> To parse HTML you have to set ``xslt_types text/html;``. This
> >> changeset prevents crashing on responses with an empty body.
> >
> > I don't think that skipping transformation based on
> > content_length_n is a correct behaviour.
> >
> > 1) There are more than one way to represent empty response, and
> > content_length_n == 0 is just one of them.
> >
> > 2) And the empty response is just another case of malformed input,
> > I see no reason to handle this specially.
> 
> Is there a good example of a filter module doing correct error
> handling to use here, perhaps ngx_http_sub_filter_module.c?

The ngx_http_image_filter_module.c is closest one to xslt (both 
transform full response and may return error instead).  Though I 
think error handling in xslt as of now is good enough.

> >> changeset:   669:9487b3a0e3ff
> >> user:        Laurence Rowe <laurence at lrowe.co.uk>
> >> date:        Fri Mar 09 21:01:25 2012 +0000
> >> summary:     Use xmlCtxtUseOptions to set options.
> >>
> >> This changeset moves most option setting to the xmlCtxtUseOptions
> >> (foundational to next commit.)
> >
> > This doesn't produce identical ctxt->loadsubset, and hence need
> > more details.  I tend to think the current code is wrong and
> > your's is ok, though it would be good to see confirmation.
> 
> In the current version of libxml2, loadsubset is a bit field. The
> previous value of 1 does not match any of the loadsubset bit field
> options (XML_DETECT_IDS=2, XML_COMPLETE_ATTRS=4, XML_SKIP_IDS=8) but
> from reading the source, XML_DETECT_IDS is not tested for directly in
> any relevant files, with SAX2.c using the test (ctxt->loadsubset !=
> 0). The loadsubset value is set to XML_DETECT_IDS by the
> xmlCtxtUseOptions call with XML_PARSE_DTDLOAD.
> 
> I guess it must have been a change sometime in the distant past to
> make it a bit field, but it was definitely a bit field as far back as
> LIBXML2_2_6_21 from 2005.

Ok, my investigations match.

> > (Additionally, there is a style issue in this patch.)
> 
> Is there a style guide for nginx code?

Not really.  We probably have to write one, at least with some basics.

> I couldn't find an example of
> how a long bitfield should be split across lines.

See e.g. in ngx_http_core_module.c:

    { ngx_string("limit_rate"),
      NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_HTTP_LOC_CONF|NGX_HTTP_LIF_CONF
                        |NGX_CONF_TAKE1,
      ngx_conf_set_size_slot,
      NGX_HTTP_LOC_CONF_OFFSET,
      offsetof(ngx_http_core_loc_conf_t, limit_rate),
      NULL },

In extreme cases short form is used, like this (from 
ngx_http_upstream.c):

    uscf = ngx_http_upstream_add(cf, &u, NGX_HTTP_UPSTREAM_CREATE
                                         |NGX_HTTP_UPSTREAM_WEIGHT
                                         |NGX_HTTP_UPSTREAM_MAX_FAILS
                                         |NGX_HTTP_UPSTREAM_FAIL_TIMEOUT
                                         |NGX_HTTP_UPSTREAM_DOWN
                                         |NGX_HTTP_UPSTREAM_BACKUP);

Though usually it's better to avoid the split.

> >> changeset:   670:c8349ca87381
> >> user:        Laurence Rowe <laurence at lrowe.co.uk>
> >> date:        Fri Mar 09 21:08:18 2012 +0000
> >> summary:     XML_PARSE_COMPACT to save memory
> >>
> >> We can use XML_PARSE_COMPACT as the parsed input document is not
> >> modified (XSLT creates a new result document.)
> >
> > Is it supported in libxml2 versions used by various OSes now?  It
> > appeared relatively recently, in libxml2 2.6.21, and if some OSes
> > still use older versions by default - this may need configure
> > test.
> >
> > Another question to consider - is it actually safe to use?  Note
> > that nginx does modify input document to provide DTDs.
> >
> > It would be also cool to see some stats on real world memory usage
> > with and without this option set.
> 
> The best benchmark I could find was here:
> http://comments.gmane.org/gmane.comp.python.lxml.devel/1507.
> Essentially it is helpful for documents with lots of short text nodes.

It looks like performance benchmark, and I doubt the speedup will 
be compareable with nginx given the fact it uses nginx pool 
allocator which is expected to be faster than normal malloc().

> I hadn't noticed the document modification to provide DTDs. That makes
> it unsafe to use and it should be dropped.

Ok.

> >> changeset:   671:7145bd8cc1e2
> >> tag:         tip
> >> user:        Laurence Rowe <laurence at lrowe.co.uk>
> >> date:        Fri Mar 09 21:47:46 2012 +0000
> >> summary:     xslt_html_parser
> >>
> >> This changeset adds the ``xslt_html_parser`` directive and uses the
> >> HTMLParser when it is set. HTML parsing is performed with
> >> HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
> >> handling is thus disabled when this option is set.
> >
> > I in general like the feature, though I tend to think this patch
> > needs more polishing.  In no particular order:
> >
> > 1) Hardcoded "utf-8" charset detection scares me.
> 
> Looks like xmlParseCharEncoding is the way to go here.
> http://apache.webthing.com/mod_proxy_html30/mod_proxy_html-2.5.2.c

Yes, probably.

> > 2) The are no error handling for html code, and while it's
> > probably ok to ignore parsing problems - it's certainly not ok to
> > ignore fatal problems like memory allocation ones.
> 
> I'll add back the fatal error handler and add an off switch to it.

Not really understand.  Why do you think it should be possible to 
switch off handling of memory allocation errors?

> > 3) DTD and entities handling needs clarification.
> 
> Anything other to the XML_PARSE_COMPACT concern noted above? The DTD
> is ignored by the HTMLParser, so I do not set
> ctxt->sax->externalSubset when parsing html.

My concerns were mostly about "what will happen if DTD referenced in 
a document".  It's ok as long as it's not used anyway.

Maxim Dounin