<div dir="auto">Thanks reviewing!</div><div dir="auto"><br></div><div dir="auto">I agree with your early return strategy and I would reconsider that condition below.</div><div dir="auto"><br></div><div dir="auto"><div dir="auto"># HG changeset patch</div><div dir="auto"># User Yugo Horie <<a href="mailto:u5.horie@gmail.com">u5.horie@gmail.com</a>></div><div dir="auto"># Date 1677107390 -32400</div><div dir="auto">#      Thu Feb 23 08:09:50 2023 +0900</div><div dir="auto"># Node ID a3ca45d39fcfd32ca92a6bd25ec18b6359b90f1a</div><div dir="auto"># Parent  f4653576ffcd286bed7229e18ee30ec3c713b4de</div><div dir="auto">Core: restrict the rule of utf-8 decode.</div><div dir="auto"><br></div><div dir="auto">The first byte being above 0xf8 which is referred to 5byte</div><div dir="auto">over length older utf-8 becomes invalid.</div><div dir="auto">Even the range of the first byte from 0xf5 to</div><div dir="auto">0xf7 is valid in the term of the codepoint decoding.</div><div dir="auto">See <a href="https://datatracker.ietf.org/doc/html/rfc3629#section-4">https://datatracker.ietf.org/doc/html/rfc3629#section-4</a>.</div><div dir="auto"><br></div><div dir="auto">diff -r f4653576ffcd -r a3ca45d39fcf src/core/ngx_string.c</div><div dir="auto">--- a/src/core/ngx_string.c     Thu Feb 23 07:56:44 2023 +0900</div><div dir="auto">+++ b/src/core/ngx_string.c     Thu Feb 23 08:09:50 2023 +0900</div><div dir="auto">@@ -1363,8 +1363,12 @@</div><div dir="auto">     uint32_t  u, i, valid;</div><div dir="auto"> </div><div dir="auto">     u = **p;</div><div dir="auto">-</div><div dir="auto">-    if (u >= 0xf0) {</div><div dir="auto">+    if (u >= 0xf8) {</div><div dir="auto">+</div><div dir="auto">+        (*p)++;</div><div dir="auto">+        return 0xffffffff;</div><div dir="auto">+</div><div dir="auto">+    } else if (u >= 0xf0) {</div><div dir="auto"> </div><div dir="auto">         u &= 0x07;</div><div dir="auto">         valid = 0xffff;</div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 23, 2023 at 6:41 Maxim Dounin <<a href="mailto:mdounin@mdounin.ru">mdounin@mdounin.ru</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">Hello!<br>

<br>

On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:<br>

<br>

> # HG changeset patch<br>

> # User Yugo Horie <<a href="mailto:u5.horie@gmail.com" target="_blank">u5.horie@gmail.com</a>><br>

> # Date 1677077775 -32400<br>

> #      Wed Feb 22 23:56:15 2023 +0900<br>

> # Node ID 1a9487706c6af90baf2ed770db29f689c3850721<br>

> # Parent  cffaf3f2eec8fd33605c2a37814f5ffc30371989<br>

> core: return error when the first byte is above 0xf5 in utf-8<br>

> <br>

> * see <a href="https://datatracker.ietf.org/doc/html/rfc3629#section-4" rel="noreferrer" target="_blank">https://datatracker.ietf.org/doc/html/rfc3629#section-4</a><br>

> <br>

<br>

Thanks for catching, this needs to be addressed.<br>

<br>

The existing code accepts invalid UTF-8 sequences with the first <br>

byte being 11111xxx (>= 0xf8), and interprets them as if the first <br>

byte was 11110xxx (as a 4-byte sequence).  This is going to <br>

misinterpret future UTF-8 if it will be ever expanded to 5-byte <br>

sequences (unlikely though), and might cause other problems <br>

(including security ones, as the RFC rightfully warns, though also <br>

unlikely in this particular case).<br>

<br>

A more verbose commit log might be beneficial here, as well as <br>

the one matching nginx style for commit logs.  See <br>

<a href="http://nginx.org/en/docs/contributing_changes.html" rel="noreferrer" target="_blank">http://nginx.org/en/docs/contributing_changes.html</a> for some basic <br>

tips.<br>

<br>

> diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c<br>

> --- a/src/core/ngx_string.c     Thu Feb 02 23:38:48 2023 +0300<br>

> +++ b/src/core/ngx_string.c     Wed Feb 22 23:56:15 2023 +0900<br>

> @@ -1364,7 +1364,7 @@<br>

> <br>

>      u = **p;<br>

> <br>

> -    if (u >= 0xf0) {<br>

> +    if (u < 0xf5 && u >= 0xf0) {<br>

> <br>

>          u &= 0x07;<br>

>          valid = 0xffff;<br>

<br>

The suggested change doesn't look correct to me.<br>

<br>

In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is <br>

valid in terms of the codepoint decoding, though resulting value <br>

will be in the invalid range as per the comment above the <br>

function.  So it does not really need any special handling, except <br>

may be to simplify things.  The interesting part here is "u >= 0xf8" <br>

(which is complimentary to "u &= 0x07" being used in "if").<br>

<br>

Further, as the condition is written, "u >= 0xf5" is now excluded <br>

from the first "if", but will be matched by the "if (u >= 0xe0)" <br>

below, and misinterpreted as a 3-byte sequence.  This looks worse <br>

than the existing behaviour.<br>

<br>

I would rather consider something like:<br>

<br>

diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c<br>

--- a/src/core/ngx_string.c<br>

+++ b/src/core/ngx_string.c<br>

@@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)<br>

<br>

     u = **p;<br>

<br>

-    if (u >= 0xf0) {<br>

+    if (u >= 0xf8) {<br>

+<br>

+        (*p)++;<br>

+        return 0xffffffff;<br>

+<br>

+    } else if (u >= 0xf0) {<br>

<br>

         u &= 0x07;<br>

         valid = 0xffff;<br>

<br>

<br>

-- <br>

Maxim Dounin<br>

<a href="http://mdounin.ru/" rel="noreferrer" target="_blank">http://mdounin.ru/</a><br>

_______________________________________________<br>

nginx-devel mailing list<br>

<a href="mailto:nginx-devel@nginx.org" target="_blank">nginx-devel@nginx.org</a><br>

<a href="https://mailman.nginx.org/mailman/listinfo/nginx-devel" rel="noreferrer" target="_blank">https://mailman.nginx.org/mailman/listinfo/nginx-devel</a><br>

</blockquote></div></div>