Danger to Nginx from raw unicode in paths?

David xeoncross at gmail.com
Mon Jan 26 01:06:13 UTC 2015

I was recently wondering if I should filter URL's by characters to only
allow what is standard in applications.

Words, Numbers, and couple characters [.-_/\]. We know the list of
supported URL's and Domains is really just a subset of ASCII

However, I'm not totally sure what nginx does when I pass "µ" to it.

I came up with a simple regular expression to match something that isn't
one of those:

location ~* "(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)" ) {
if ($uri ~* "(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)" ) {

However, I'm wondering if I actually need to use the UTF-8 matching since
clients should default to URL encoding (%20) or hex encoding (\x23) the
bytes and the actual transfer should be binary anyway.

Here is an example test where I piped almost all 65,000 unicode points to
nginx via curl:


For example: $ curl -v http://localhost/与

Basically, is there any point to watching URL's for non-standard sequences
looking for possible attacks?

( FYI: I posted more details that led to this question here:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20150125/0e8ac48a/attachment.html>

More information about the nginx mailing list