[PATCH] Core: return error when the first byte is above 0xf5 in utf-8
u5h
u5.horie at gmail.com
Thu Feb 23 00:24:52 UTC 2023
Thanks reviewing!
I agree with your early return strategy and I would reconsider that
condition below.
# HG changeset patch
# User Yugo Horie <u5.horie at gmail.com>
# Date 1677107390 -32400
# Thu Feb 23 08:09:50 2023 +0900
# Node ID a3ca45d39fcfd32ca92a6bd25ec18b6359b90f1a
# Parent f4653576ffcd286bed7229e18ee30ec3c713b4de
Core: restrict the rule of utf-8 decode.
The first byte being above 0xf8 which is referred to 5byte
over length older utf-8 becomes invalid.
Even the range of the first byte from 0xf5 to
0xf7 is valid in the term of the codepoint decoding.
See https://datatracker.ietf.org/doc/html/rfc3629#section-4.
diff -r f4653576ffcd -r a3ca45d39fcf src/core/ngx_string.c
--- a/src/core/ngx_string.c Thu Feb 23 07:56:44 2023 +0900
+++ b/src/core/ngx_string.c Thu Feb 23 08:09:50 2023 +0900
@@ -1363,8 +1363,12 @@
uint32_t u, i, valid;
u = **p;
-
- if (u >= 0xf0) {
+ if (u >= 0xf8) {
+
+ (*p)++;
+ return 0xffffffff;
+
+ } else if (u >= 0xf0) {
u &= 0x07;
valid = 0xffff;
On Thu, Feb 23, 2023 at 6:41 Maxim Dounin <mdounin at mdounin.ru> wrote:
> Hello!
>
> On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:
>
> > # HG changeset patch
> > # User Yugo Horie <u5.horie at gmail.com>
> > # Date 1677077775 -32400
> > # Wed Feb 22 23:56:15 2023 +0900
> > # Node ID 1a9487706c6af90baf2ed770db29f689c3850721
> > # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989
> > core: return error when the first byte is above 0xf5 in utf-8
> >
> > * see https://datatracker.ietf.org/doc/html/rfc3629#section-4
> >
>
> Thanks for catching, this needs to be addressed.
>
> The existing code accepts invalid UTF-8 sequences with the first
> byte being 11111xxx (>= 0xf8), and interprets them as if the first
> byte was 11110xxx (as a 4-byte sequence). This is going to
> misinterpret future UTF-8 if it will be ever expanded to 5-byte
> sequences (unlikely though), and might cause other problems
> (including security ones, as the RFC rightfully warns, though also
> unlikely in this particular case).
>
> A more verbose commit log might be beneficial here, as well as
> the one matching nginx style for commit logs. See
> http://nginx.org/en/docs/contributing_changes.html for some basic
> tips.
>
> > diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c
> > --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300
> > +++ b/src/core/ngx_string.c Wed Feb 22 23:56:15 2023 +0900
> > @@ -1364,7 +1364,7 @@
> >
> > u = **p;
> >
> > - if (u >= 0xf0) {
> > + if (u < 0xf5 && u >= 0xf0) {
> >
> > u &= 0x07;
> > valid = 0xffff;
>
> The suggested change doesn't look correct to me.
>
> In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is
> valid in terms of the codepoint decoding, though resulting value
> will be in the invalid range as per the comment above the
> function. So it does not really need any special handling, except
> may be to simplify things. The interesting part here is "u >= 0xf8"
> (which is complimentary to "u &= 0x07" being used in "if").
>
> Further, as the condition is written, "u >= 0xf5" is now excluded
> from the first "if", but will be matched by the "if (u >= 0xe0)"
> below, and misinterpreted as a 3-byte sequence. This looks worse
> than the existing behaviour.
>
> I would rather consider something like:
>
> diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
> --- a/src/core/ngx_string.c
> +++ b/src/core/ngx_string.c
> @@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)
>
> u = **p;
>
> - if (u >= 0xf0) {
> + if (u >= 0xf8) {
> +
> + (*p)++;
> + return 0xffffffff;
> +
> + } else if (u >= 0xf0) {
>
> u &= 0x07;
> valid = 0xffff;
>
>
> --
> Maxim Dounin
> http://mdounin.ru/
> _______________________________________________
> nginx-devel mailing list
> nginx-devel at nginx.org
> https://mailman.nginx.org/mailman/listinfo/nginx-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20230223/3f1c34b4/attachment.htm>
More information about the nginx-devel
mailing list