HTML parsing support for xslt module

Laurence Rowe l at lrowe.co.uk
Wed Mar 14 17:03:56 UTC 2012


I'd like to submit the attached patch which implements an
``xslt_html_parser`` directive for consideration. When enabled, the
xslt module uses the libxml2 HTMLParser to parse the response body.
This is useful for people who want to transform HTML using XSLT,
including users of Diazo deploying on Nginx
(http://docs.diazo.org/en/latest/deployment.html#nginx).

The patch is generated from my repository at
https://bitbucket.org/lrowe/nginx-xslt-html-parser, forked from
http://mdounin.ru/hg/nginx-vendor-current/. The xslt_param patch from
http://mailman.nginx.org/pipermail/nginx-devel/2012-March/001926.html
is included in my repository. I'll discuss each of the individual
changesets briefly below:

changeset:   668:bf4d14f51436
user:        Laurence Rowe <laurence at lrowe.co.uk>
date:        Sun Jul 11 23:54:08 2010 +0100
summary:     Skip transform when there is no content (e.g. a proxied redirect)

This was originally reviewed in
http://mailman.nginx.org/pipermail/nginx-devel/2010-July/000390.html:

> Currently the way to disable XSLT processing is MIME type, "text/xml"
> by default. Redirects usually have "text/html" type.

To parse HTML you have to set ``xslt_types text/html;``. This
changeset prevents crashing on responses with an empty body.

changeset:   669:9487b3a0e3ff
user:        Laurence Rowe <laurence at lrowe.co.uk>
date:        Fri Mar 09 21:01:25 2012 +0000
summary:     Use xmlCtxtUseOptions to set options.

This changeset moves most option setting to the xmlCtxtUseOptions
(foundational to next commit.)

changeset:   670:c8349ca87381
user:        Laurence Rowe <laurence at lrowe.co.uk>
date:        Fri Mar 09 21:08:18 2012 +0000
summary:     XML_PARSE_COMPACT to save memory

We can use XML_PARSE_COMPACT as the parsed input document is not
modified (XSLT creates a new result document.)

changeset:   671:7145bd8cc1e2
tag:         tip
user:        Laurence Rowe <laurence at lrowe.co.uk>
date:        Fri Mar 09 21:47:46 2012 +0000
summary:     xslt_html_parser

This changeset adds the ``xslt_html_parser`` directive and uses the
HTMLParser when it is set. HTML parsing is performed with
HTML_PARSE_RECOVER as real-world HTML may not be well formed, error
handling is thus disabled when this option is set.


Laurence
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xslt_html_parser.diff
Type: application/octet-stream
Size: 9209 bytes
Desc: not available
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20120314/1f865912/attachment.obj>


More information about the nginx-devel mailing list