[PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Wed Feb 22 21:40:54 UTC 2023

Hello!

On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:

> # HG changeset patch
> # User Yugo Horie <u5.horie at gmail.com>
> # Date 1677077775 -32400
> #      Wed Feb 22 23:56:15 2023 +0900
> # Node ID 1a9487706c6af90baf2ed770db29f689c3850721
> # Parent  cffaf3f2eec8fd33605c2a37814f5ffc30371989
> core: return error when the first byte is above 0xf5 in utf-8
> 
> * see https://datatracker.ietf.org/doc/html/rfc3629#section-4
> 

Thanks for catching, this needs to be addressed.

The existing code accepts invalid UTF-8 sequences with the first 
byte being 11111xxx (>= 0xf8), and interprets them as if the first 
byte was 11110xxx (as a 4-byte sequence).  This is going to 
misinterpret future UTF-8 if it will be ever expanded to 5-byte 
sequences (unlikely though), and might cause other problems 
(including security ones, as the RFC rightfully warns, though also 
unlikely in this particular case).

A more verbose commit log might be beneficial here, as well as 
the one matching nginx style for commit logs.  See 
http://nginx.org/en/docs/contributing_changes.html for some basic 
tips.

> diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c
> --- a/src/core/ngx_string.c     Thu Feb 02 23:38:48 2023 +0300
> +++ b/src/core/ngx_string.c     Wed Feb 22 23:56:15 2023 +0900
> @@ -1364,7 +1364,7 @@
> 
>      u = **p;
> 
> -    if (u >= 0xf0) {
> +    if (u < 0xf5 && u >= 0xf0) {
> 
>          u &= 0x07;
>          valid = 0xffff;

The suggested change doesn't look correct to me.

In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is 
valid in terms of the codepoint decoding, though resulting value 
will be in the invalid range as per the comment above the 
function.  So it does not really need any special handling, except 
may be to simplify things.  The interesting part here is "u >= 0xf8" 
(which is complimentary to "u &= 0x07" being used in "if").

Further, as the condition is written, "u >= 0xf5" is now excluded 
from the first "if", but will be matched by the "if (u >= 0xe0)" 
below, and misinterpreted as a 3-byte sequence.  This looks worse 
than the existing behaviour.

I would rather consider something like:

diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
--- a/src/core/ngx_string.c
+++ b/src/core/ngx_string.c
@@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)
 
     u = **p;
 
-    if (u >= 0xf0) {
+    if (u >= 0xf8) {
+
+        (*p)++;
+        return 0xffffffff;
+
+    } else if (u >= 0xf0) {
 
         u &= 0x07;
         valid = 0xffff;


-- 
Maxim Dounin
http://mdounin.ru/