Danger to Nginx from raw unicode in paths?

Mon Jan 26 01:06:13 UTC 2015

I was recently wondering if I should filter URL's by characters to only
allow what is standard in applications.

Words, Numbers, and couple characters [.-_/\]. We know the list of
supported URL's and Domains is really just a subset of ASCII
<http://perishablepress.com/stop-using-unsafe-characters-in-urls/>.

However, I'm not totally sure what nginx does when I pass "µ" to it.

I came up with a simple regular expression to match something that isn't
one of those:

location ~* "(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)" ) {
if ($uri ~* "(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)" ) {

However, I'm wondering if I actually need to use the UTF-8 matching since
clients should default to URL encoding (%20) or hex encoding (\x23) the
bytes and the actual transfer should be binary anyway.

Here is an example test where I piped almost all 65,000 unicode points to
nginx via curl:

https://gist.github.com/Xeoncross/acca3f09c5aeddac8c9f

For example: $ curl -v http://localhost/与

Basically, is there any point to watching URL's for non-standard sequences
looking for possible attacks?

( FYI: I posted more details that led to this question here:
http://stackoverflow.com/questions/28055909/does-nginx-support-raw-unicode-in-paths
 )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20150125/0e8ac48a/attachment.html>