<div dir="ltr"><div class="gmail_quote">I was recently wondering if I should filter URL's by characters to only allow what is standard in applications.</div><div class="gmail_quote"><br></div><div class="gmail_quote">Words, Numbers, and couple characters [.-_/\]. We know the list of supported URL's and Domains is <a href="http://perishablepress.com/stop-using-unsafe-characters-in-urls/">really just a subset of ASCII</a>.</div><div class="gmail_quote"><br></div><div class="gmail_quote">However, I'm not totally sure what nginx does when I pass "<span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre">µ" </span>to it.</div><div class="gmail_quote"><br></div><div class="gmail_quote">I came up with a simple regular expression to match something that isn't one of those:</div><div class="gmail_quote"><br></div><div class="gmail_quote"><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre">location ~* </span><span class="" style="color:rgb(223,80,0);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre"><span class="">"</span>(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)<span class="">"</span></span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre"> ) {</span><br></div><div class="gmail_quote"><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre">if ($uri ~* </span><span class="" style="color:rgb(223,80,0);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre"><span class="">"</span>(*UTF8)([^\p{L}\p{N}/\.\-\%\\\]+)<span class="">"</span></span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre"> ) {</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.7999992370605px;white-space:pre"><br></span></div><div class="gmail_quote"><br></div><div class="gmail_quote">However, I'm wondering if I actually need to use the UTF-8 matching since clients should default to URL encoding (%20) or hex encoding (\x23) the bytes and the actual transfer should be binary anyway.</div><div class="gmail_quote"><br></div><div class="gmail_quote">Here is an example test where I piped almost all 65,000 unicode points to nginx via curl:</div><div class="gmail_quote"><br></div><div class="gmail_quote"><a href="https://gist.github.com/Xeoncross/acca3f09c5aeddac8c9f">https://gist.github.com/Xeoncross/acca3f09c5aeddac8c9f</a><br></div><div class="gmail_quote"><br></div><div class="gmail_quote">For example: $ curl -v <a href="http://localhost/%E4%B8%8E" target="_blank">http://localhost/与</a></div><div class="gmail_quote"><br></div><div class="gmail_quote">Basically, is there any point to watching URL's for non-standard sequences looking for possible attacks?</div><div class="gmail_quote"><br></div><div class="gmail_quote">( FYI: I posted more details that led to this question here:<br>
<a href="http://stackoverflow.com/questions/28055909/does-nginx-support-raw-unicode-in-paths" target="_blank">http://stackoverflow.com/questions/28055909/does-nginx-support-raw-unicode-in-paths</a> )<br><br><br>
</div><br></div>