[PATCH] Core: return error when the first byte is above 0xf5 in utf-8

Wed Mar 1 23:51:11 UTC 2023

Hello!

On Thu, Feb 23, 2023 at 09:24:52AM +0900, u5h wrote:

> Thanks reviewing!
> 
> I agree with your early return strategy and I would reconsider that
> condition below.
> 
> # HG changeset patch
> # User Yugo Horie <u5.horie at gmail.com>
> # Date 1677107390 -32400
> #      Thu Feb 23 08:09:50 2023 +0900
> # Node ID a3ca45d39fcfd32ca92a6bd25ec18b6359b90f1a
> # Parent  f4653576ffcd286bed7229e18ee30ec3c713b4de
> Core: restrict the rule of utf-8 decode.
> 
> The first byte being above 0xf8 which is referred to 5byte
> over length older utf-8 becomes invalid.
> Even the range of the first byte from 0xf5 to
> 0xf7 is valid in the term of the codepoint decoding.
> See https://datatracker.ietf.org/doc/html/rfc3629#section-4.
> 
> diff -r f4653576ffcd -r a3ca45d39fcf src/core/ngx_string.c
> --- a/src/core/ngx_string.c     Thu Feb 23 07:56:44 2023 +0900
> +++ b/src/core/ngx_string.c     Thu Feb 23 08:09:50 2023 +0900
> @@ -1363,8 +1363,12 @@
>      uint32_t  u, i, valid;
> 
>      u = **p;
> -
> -    if (u >= 0xf0) {
> +    if (u >= 0xf8) {
> +
> +        (*p)++;
> +        return 0xffffffff;
> +
> +    } else if (u >= 0xf0) {
> 
>          u &= 0x07;
>          valid = 0xffff;

Slightly adjusted the commit log to better explain the issue (and 
restored the accidentally removed empty line).  Please take a look 
if it seems good enough:

# HG changeset patch
# User Yugo Horie <u5.horie at gmail.com>
# Date 1677107390 -32400
#      Thu Feb 23 08:09:50 2023 +0900
# Node ID a10210a45c8b6e6bb75e98b2fd64a80c184ae247
# Parent  2acb00b9b5fff8a97523b659af4377fc605abe6e
Core: stricter UTF-8 handling in ngx_utf8_decode().

An UTF-8 octet sequence cannot start with a 11111xxx byte (above 0xf8),
see https://datatracker.ietf.org/doc/html/rfc3629#section-3.  Previously,
such bytes were accepted by ngx_utf8_decode() and misinterpreted as 11110xxx
bytes (as in a 4-byte sequence).  While unlikely, this can potentially cause
issues.

Fix is to explicitly reject such bytes in ngx_utf8_decode().

diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
--- a/src/core/ngx_string.c
+++ b/src/core/ngx_string.c
@@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)
 
     u = **p;
 
-    if (u >= 0xf0) {
+    if (u >= 0xf8) {
+
+        (*p)++;
+        return 0xffffffff;
+
+    } else if (u >= 0xf0) {
 
         u &= 0x07;
         valid = 0xffff;


-- 
Maxim Dounin
http://mdounin.ru/