[PATCH] HTTP: Add new uri_normalization_percent_decode option

Tue Apr 4 03:26:06 UTC 2023

Hello!

On Mon, Apr 03, 2023 at 06:33:21PM +0000, Michael Kourlas via nginx-devel wrote:

> > Consider the following rewrite:
> >
> > rewrite ^/(.*) /$1 break;
> >
> > Assuming request to "GET /foo%2fbar/", what
> > $uri_encoded_percent_and_reserved do you expect after each of
> > these rewrites?
> 
> I do not think that rewrite does anything in practice. Following the rewrite,
> I would expect $uri to remain unchanged at its current value of "/foo/bar/" and
> $uri_encoded_percent_and_reserved to similarly remain unchanged at its current
> value of "/foo%2fbar/".

How do you expect this to be detected by the code?  Direct 
comparison of the new URI as generated by the rewrite with the old 
URI?

Bonus question: what happens with the following rewrite:

rewrite ^/(.*) /bazz/$1 break;

It is basically equivalent to the above except it certainly 
changes the URI.

> > Similarly, consider the following rewrite:
> >
> > rewrite ^/foo/(.*) /$1 break;
> >
> > What $uri_encoded_percent_and_reserved is expected after the
> > rewrite?
> 
> In this case the regular expression matches $uri but not
> $uri_encoded_percent_and_reserved. One could say that this just means that only
> $uri is updated, but that has the potential to cause confusion when a flag is
> used that changes the control flow (unless the user explicitly opts into this
> behaviour).
> 
> This could be addressed by adding a "match-source" optional argument to
> "rewrite" with three values (and a default of "uri"):
> * "uri" - rewrite directive matches and changes $uri only
> * "uri_encoded_percent_and_reserved" - rewrite directive matches and changes
>   $uri_encoded_percent_and_reserved only
> * "all" - rewrite directive matches and changes both (if only one is matched,
>   directive is not applied)
> 
> It might also be a good idea to add "uri_encoded_percent_and_reserved_regex"
> and "uri_encoded_percent_and_reserved_replacement" arguments to be used with
> "all", so that it is possible to use the same directive and flag even when
> needing to perform slightly different rewrites for $uri versus
> $uri_encoded_percent_and_reserved.

So, your original suggestion that "every transformation applied to 
$uri ... is automatically applied to 
$uri_encoded_percent_and_reserved as well" is no longer relevant, 
correct?

Considering "match-source=uri", how do you expect the resulting 
$uri_encoded_percent_and_reserved to be set?  From the description 
it looks like it is expected to remain unchanged, though for the 
above example this will result in $uri being "/bar/" and 
$uri_encoded_percent_and_reserved being "/foo%2fbar/", which looks 
certainly wrong.  Is it really the intention?

The same question applies to 
"match-source=uri_encoded_percent_and_reserved".

> > Consider the following configuration:
> >
> > location /foo%2fbar/ match-source=uri-encoded-percent-and-reserved {
> > ...
> > }
> >
> > location /foo/bar/ match-source=uri {
> > ...
> > }
> >
> > The question is: which location is expected to be matched for the
> > request "GET /foo%2fbar/"?
> 
> Both blocks match their respective variables. Since the first block has the
> longest matching prefix, I expect it will be selected.

That's not how prefix locations work: they are matched based on 
the longest prefix, and order of locations does not matter (and 
not even preserved/known during matching, since matching uses a 
prefix tree).  So your suggestion implies that ordered matching 
should be introduced for prefix locations, correct?

> > Which location is expected to be matched for the request "GET
> > /foo%2Fbar/" (note that it is exactly equivalent to "GET
> > /foo%2fbar/").
> 
> Only the second block matches its variable, so I expect it will be selected.
> 
> Although paths are generally case sensitive, a percent-encoded character is not
> supposed to be, so this behaviour is unfortunate. One possibility is to
> automatically use case-insensitive matching for any part of a location prefix
> using "match-source=uri-encoded-percent-and-reserved" that is a percent
> encoded "%" or reserved character.
> 
> Alternatively this behaviour could be documented with an instruction to use
> regular expressions instead.

You mean, only allow 
"match-source=uri-encoded-percent-and-reserved" for locations 
given by regular expressions, correct?

> > Assuming static handling in the locations, what happens with the
> > request "GET /foo%2fbar/..%2fbazz"?
> 
> The first block would be used. However, $uri would be used for static
> handling, so the path "/foo/bazz" would be looked up on the filesystem, not
> "/foo%2fbar/..%2fbazz".

So it can be easily used to bypass security restrictions, such as 
in:

location /foo%2fbar/ match-source=uri-encoded-percent-and-reserved {
}

location /admin/ {
    allow 127.0.0.1;
    deny all;
}

Note that the request "GET /foo%2fbar/..%2f..%2fadmin/secret" is 
going to access protected files in <root>/admin/, while it 
shouldn't be allowed to.

> > Note that the behaviour does not seem to be obvious, and it is an
> > open question if it can be clarified to be safe.
> 
> Fair enough. I am certainly happy to continue making changes to my proposal to
> address the specific concerns you raise. However, are you saying that you have
> broader overall concerns about safety, complexity, etc. that make a patch
> implementing the proposal unlikely to be accepted, even if all specific
> concerns are addressed?

I've just asked some question on how your proposal is expected to 
work - and summarized that the behaviour suggested is certainly 
not obvious, since there are lots of questions on how it is 
expected to behave in various edge cases.

But I certainly agree that there are issues with safety and 
complexity in the proposal.  And I don't think that addressing any 
specific concerns will help here, since addressing them seem to 
result in increased complexity and more concerns.  It might be a 
good idea to start from scratch instead of trying to fix the 
proposal.

Just in case, below is the summary of what I think about the 
topic.

Short version:

All solutions I'm aware of suck.  Don't use encoded slashes in 
URIs, it hurts.

Long version:

First of all, forget about "reserved characters".  The only 
character that really matters is slash ("/"), as this is the only 
character which is indeed reserved in URI path nginx works with.

There are 3 basic approaches to handling encoded slashes in URIs, 
mostly outlined in Apache's AllowEncodedSlashes directive (see 
https://httpd.apache.org/docs/2.4/mod/core.html#allowencodedslashes
for details):

1. Reject them.  This what Apache does by default (though with 
questionable error code).

2. Decode them and assume equivalent to non-encoded slashes.  This 
is what nginx does.  Apache does something similar with 
"AllowEncodedSlashes On", but given the note in the directive 
description it looks like it does not implement the "assume 
equivalent" part.

3. Do not decode them and expect encoded slashes are corrupted if 
URI is re-encoded.  This also implies that if URI is not 
re-encoded but proxied as is, restrictions like in "location 
/admin/ { deny all; }" can be bypassed if slashes are decoded by 
the backend server.

All of the approaches have their pros and cons, with the reject 
one being the safest, while decode and do not decode resulting in 
different forms of corruptions, and not decoding resulting in 
potential security issues on proxying.

I don't think that trying to combine different approaches in the 
normal location matching is going to work.  Rather, this will 
result in unmanageable complexity and will be unsafe, as in the 
above proposal.

In nginx, the only implemented approach is "decode".  It does, 
however, also provides the original request URI in the 
$request_uri variable, so the "reject" approach can be trivially 
implemented in the configuration:

    if ($request_uri ~* "%2f") {
        return 400;
    }

Similarly, as already mentioned in #2225, one can do something 
like this:

    if ($request_uri ~ "^/api/objects/[^/]+/subobjects(/.*)?$") {
        ...
    }

Further, nginx makes it possible to proxy the request without any URI 
modifications, as in "proxy_pass http://upstream;" (note no URI 
component after the upstream server name), so requests with 
encoded slashes can be inspected and proxied without corruption. 

This might be already enough for most, if not all, tasks involving 
encoded slashes: they can be handled without corruption if things 
are properly configured.  Given the above options, this is more 
than usually available.

Note that $request_uri matching doesn't imply various 
normalizations nginx usually does on URI before matching 
locations, such as decoding encoded characters, so something like 
"GET /%61pi/objects/..." won't be matched.  This can be improved 
by providing an additional variable, or by a smart enough 
urldecode() function (see ticket #52), or by an arbitrary 
normalization written in an embedded language, such as Perl or 
njs.  Or it might not worth the effort though, and just rejecting 
such non-normalized requests would be a good enough solution for 
most use cases.

Similarly, it might be possible to explicitly implement the "do 
not decode" approach, with something like "decode_slashes off;" 
(similarly to "merge_slashes off;").  I tend to think that it is 
going to be just another "never use it, it is not secure" 
directive, much like "merge_slashes", and the mere existence of 
this directive is not going to help anyone.  (Also, it might be 
actually a good idea to remove "merge_slashes" instead.)

Bonus game:

The dot (".") character is not reserved in URIs, yet it has 
special meaning when used in "/./" and "/../" constructs.  And 
escaping dots won't help: since "." is not reserved, "/%2e/" is 
exactly equivalent to "/./", and it is in turn equivalent to "/".  
As such, APIs designed with arbitrary object names in URIs and 
assuming escaping is enough simply won't be able to work for these 
names.

Hope this helps.

-- 
Maxim Dounin
http://mdounin.ru/