[PATCH] Core: return error when the first byte is above 0xf5 in utf-8

u5h u5.horie at gmail.com
Thu Feb 23 00:24:52 UTC 2023


Thanks reviewing!

I agree with your early return strategy and I would reconsider that
condition below.

# HG changeset patch
# User Yugo Horie <u5.horie at gmail.com>
# Date 1677107390 -32400
#      Thu Feb 23 08:09:50 2023 +0900
# Node ID a3ca45d39fcfd32ca92a6bd25ec18b6359b90f1a
# Parent  f4653576ffcd286bed7229e18ee30ec3c713b4de
Core: restrict the rule of utf-8 decode.

The first byte being above 0xf8 which is referred to 5byte
over length older utf-8 becomes invalid.
Even the range of the first byte from 0xf5 to
0xf7 is valid in the term of the codepoint decoding.
See https://datatracker.ietf.org/doc/html/rfc3629#section-4.

diff -r f4653576ffcd -r a3ca45d39fcf src/core/ngx_string.c
--- a/src/core/ngx_string.c     Thu Feb 23 07:56:44 2023 +0900
+++ b/src/core/ngx_string.c     Thu Feb 23 08:09:50 2023 +0900
@@ -1363,8 +1363,12 @@
     uint32_t  u, i, valid;

     u = **p;
-
-    if (u >= 0xf0) {
+    if (u >= 0xf8) {
+
+        (*p)++;
+        return 0xffffffff;
+
+    } else if (u >= 0xf0) {

         u &= 0x07;
         valid = 0xffff;

On Thu, Feb 23, 2023 at 6:41 Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:
>
> > # HG changeset patch
> > # User Yugo Horie <u5.horie at gmail.com>
> > # Date 1677077775 -32400
> > #      Wed Feb 22 23:56:15 2023 +0900
> > # Node ID 1a9487706c6af90baf2ed770db29f689c3850721
> > # Parent  cffaf3f2eec8fd33605c2a37814f5ffc30371989
> > core: return error when the first byte is above 0xf5 in utf-8
> >
> > * see https://datatracker.ietf.org/doc/html/rfc3629#section-4
> >
>
> Thanks for catching, this needs to be addressed.
>
> The existing code accepts invalid UTF-8 sequences with the first
> byte being 11111xxx (>= 0xf8), and interprets them as if the first
> byte was 11110xxx (as a 4-byte sequence).  This is going to
> misinterpret future UTF-8 if it will be ever expanded to 5-byte
> sequences (unlikely though), and might cause other problems
> (including security ones, as the RFC rightfully warns, though also
> unlikely in this particular case).
>
> A more verbose commit log might be beneficial here, as well as
> the one matching nginx style for commit logs.  See
> http://nginx.org/en/docs/contributing_changes.html for some basic
> tips.
>
> > diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c
> > --- a/src/core/ngx_string.c     Thu Feb 02 23:38:48 2023 +0300
> > +++ b/src/core/ngx_string.c     Wed Feb 22 23:56:15 2023 +0900
> > @@ -1364,7 +1364,7 @@
> >
> >      u = **p;
> >
> > -    if (u >= 0xf0) {
> > +    if (u < 0xf5 && u >= 0xf0) {
> >
> >          u &= 0x07;
> >          valid = 0xffff;
>
> The suggested change doesn't look correct to me.
>
> In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is
> valid in terms of the codepoint decoding, though resulting value
> will be in the invalid range as per the comment above the
> function.  So it does not really need any special handling, except
> may be to simplify things.  The interesting part here is "u >= 0xf8"
> (which is complimentary to "u &= 0x07" being used in "if").
>
> Further, as the condition is written, "u >= 0xf5" is now excluded
> from the first "if", but will be matched by the "if (u >= 0xe0)"
> below, and misinterpreted as a 3-byte sequence.  This looks worse
> than the existing behaviour.
>
> I would rather consider something like:
>
> diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
> --- a/src/core/ngx_string.c
> +++ b/src/core/ngx_string.c
> @@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)
>
>      u = **p;
>
> -    if (u >= 0xf0) {
> +    if (u >= 0xf8) {
> +
> +        (*p)++;
> +        return 0xffffffff;
> +
> +    } else if (u >= 0xf0) {
>
>          u &= 0x07;
>          valid = 0xffff;
>
>
> --
> Maxim Dounin
> http://mdounin.ru/
> _______________________________________________
> nginx-devel mailing list
> nginx-devel at nginx.org
> https://mailman.nginx.org/mailman/listinfo/nginx-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20230223/3f1c34b4/attachment.htm>


More information about the nginx-devel mailing list