HTML parsing support for xslt module
Laurence Rowe
l at lrowe.co.uk
Tue Mar 20 16:05:42 UTC 2012
Hi Maxim,
Thanks for your review. I've made a few comments / requests for
clarification below and will pull them together into a new patch set
over the week or so.
On 19 March 2012 17:03, Maxim Dounin <mdounin at mdounin.ru> wrote:
> Hello!
>
> On Wed, Mar 14, 2012 at 05:03:56PM +0000, Laurence Rowe wrote:
>
>> I'd like to submit the attached patch which implements an
>> ``xslt_html_parser`` directive for consideration. When enabled, the
>> xslt module uses the libxml2 HTMLParser to parse the response body.
>> This is useful for people who want to transform HTML using XSLT,
>> including users of Diazo deploying on Nginx
>> (http://docs.diazo.org/en/latest/deployment.html#nginx).
>>
>> The patch is generated from my repository at
>> https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
>> http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
>> http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
>> is included in my repository. I'll discuss each of the individual
>> changesets briefly below:
>>
>> changeset: 668:bf4d14f51436
>> user: Laurence Rowe <laurence at lrowe.co.uk>
>> date: Sun Jul 11 23:54:08 2010 +0100
>> summary: Skip transform when there is no content (e.g. a proxied redirect)
>>
>> This was originally reviewed in
>> http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:
>>
>> > Currently the way to disable XSLT processing is MIME type, "text/xml"
>> > by default. Redirects usually have "text/html" type.
>>
>> To parse HTML you have to set ``xslt_types text/html;``. This
>> changeset prevents crashing on responses with an empty body.
>
> I don't think that skipping transformation based on
> content_length_n is a correct behaviour.
>
> 1) There are more than one way to represent empty response, and
> content_length_n == 0 is just one of them.
>
> 2) And the empty response is just another case of malformed input,
> I see no reason to handle this specially.
Is there a good example of a filter module doing correct error
handling to use here, perhaps ngx_http_sub_filter_module.c?
>> changeset: 669:9487b3a0e3ff
>> user: Laurence Rowe <laurence at lrowe.co.uk>
>> date: Fri Mar 09 21:01:25 2012 +0000
>> summary: Use xmlCtxtUseOptions to set options.
>>
>> This changeset moves most option setting to the xmlCtxtUseOptions
>> (foundational to next commit.)
>
> This doesn't produce identical ctxt->loadsubset, and hence need
> more details. I tend to think the current code is wrong and
> your's is ok, though it would be good to see confirmation.
In the current version of libxml2, loadsubset is a bit field. The
previous value of 1 does not match any of the loadsubset bit field
options (XML_DETECT_IDS=2, XML_COMPLETE_ATTRS=4, XML_SKIP_IDS=8) but
from reading the source, XML_DETECT_IDS is not tested for directly in
any relevant files, with SAX2.c using the test (ctxt->loadsubset !=
0). The loadsubset value is set to XML_DETECT_IDS by the
xmlCtxtUseOptions call with XML_PARSE_DTDLOAD.
I guess it must have been a change sometime in the distant past to
make it a bit field, but it was definitely a bit field as far back as
LIBXML2_2_6_21 from 2005.
> (Additionally, there is a style issue in this patch.)
Is there a style guide for nginx code? I couldn't find an example of
how a long bitfield should be split across lines.
>> changeset: 670:c8349ca87381
>> user: Laurence Rowe <laurence at lrowe.co.uk>
>> date: Fri Mar 09 21:08:18 2012 +0000
>> summary: XML_PARSE_COMPACT to save memory
>>
>> We can use XML_PARSE_COMPACT as the parsed input document is not
>> modified (XSLT creates a new result document.)
>
> Is it supported in libxml2 versions used by various OSes now? It
> appeared relatively recently, in libxml2 2.6.21, and if some OSes
> still use older versions by default - this may need configure
> test.
>
> Another question to consider - is it actually safe to use? Note
> that nginx does modify input document to provide DTDs.
>
> It would be also cool to see some stats on real world memory usage
> with and without this option set.
The best benchmark I could find was here:
http://comments.gmane.org/gmane.comp.python.lxml.devel/1507.
Essentially it is helpful for documents with lots of short text nodes.
I hadn't noticed the document modification to provide DTDs. That makes
it unsafe to use and it should be dropped.
>> changeset: 671:7145bd8cc1e2
>> tag: tip
>> user: Laurence Rowe <laurence at lrowe.co.uk>
>> date: Fri Mar 09 21:47:46 2012 +0000
>> summary: xslt_html_parser
>>
>> This changeset adds the ``xslt_html_parser`` directive and uses the
>> HTMLParser when it is set. HTML parsing is performed with
>> HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
>> handling is thus disabled when this option is set.
>
> I in general like the feature, though I tend to think this patch
> needs more polishing. In no particular order:
>
> 1) Hardcoded "utf-8" charset detection scares me.
Looks like xmlParseCharEncoding is the way to go here.
http://apache.webthing.com/mod_proxy_html30/mod_proxy_html-2.5.2.c
> 2) The are no error handling for html code, and while it's
> probably ok to ignore parsing problems - it's certainly not ok to
> ignore fatal problems like memory allocation ones.
I'll add back the fatal error handler and add an off switch to it.
> 3) DTD and entities handling needs clarification.
Anything other to the XML_PARSE_COMPACT concern noted above? The DTD
is ignored by the HTMLParser, so I do not set
ctxt->sax->externalSubset when parsing html.
> Maxim Dounin
>
> p.s. You may want to use patchbomb mercurial extension (hg email)
> to submit patches.
I'll experiment with this for the next version of the patch.
Laurence
More information about the nginx-devel
mailing list