[PATCH] Core: return error when the first byte is above 0xf5 in utf-8
Maxim Dounin
mdounin at mdounin.ru
Wed Feb 22 21:40:54 UTC 2023
Hello!
On Thu, Feb 23, 2023 at 12:10:27AM +0900, u5h wrote:
> # HG changeset patch
> # User Yugo Horie <u5.horie at gmail.com>
> # Date 1677077775 -32400
> # Wed Feb 22 23:56:15 2023 +0900
> # Node ID 1a9487706c6af90baf2ed770db29f689c3850721
> # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989
> core: return error when the first byte is above 0xf5 in utf-8
>
> * see https://datatracker.ietf.org/doc/html/rfc3629#section-4
>
Thanks for catching, this needs to be addressed.
The existing code accepts invalid UTF-8 sequences with the first
byte being 11111xxx (>= 0xf8), and interprets them as if the first
byte was 11110xxx (as a 4-byte sequence). This is going to
misinterpret future UTF-8 if it will be ever expanded to 5-byte
sequences (unlikely though), and might cause other problems
(including security ones, as the RFC rightfully warns, though also
unlikely in this particular case).
A more verbose commit log might be beneficial here, as well as
the one matching nginx style for commit logs. See
http://nginx.org/en/docs/contributing_changes.html for some basic
tips.
> diff -r cffaf3f2eec8 -r 1a9487706c6a src/core/ngx_string.c
> --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300
> +++ b/src/core/ngx_string.c Wed Feb 22 23:56:15 2023 +0900
> @@ -1364,7 +1364,7 @@
>
> u = **p;
>
> - if (u >= 0xf0) {
> + if (u < 0xf5 && u >= 0xf0) {
>
> u &= 0x07;
> valid = 0xffff;
The suggested change doesn't look correct to me.
In particular, the "u == 0xf5" (as well as 0xf6 and 0xf7) case is
valid in terms of the codepoint decoding, though resulting value
will be in the invalid range as per the comment above the
function. So it does not really need any special handling, except
may be to simplify things. The interesting part here is "u >= 0xf8"
(which is complimentary to "u &= 0x07" being used in "if").
Further, as the condition is written, "u >= 0xf5" is now excluded
from the first "if", but will be matched by the "if (u >= 0xe0)"
below, and misinterpreted as a 3-byte sequence. This looks worse
than the existing behaviour.
I would rather consider something like:
diff --git a/src/core/ngx_string.c b/src/core/ngx_string.c
--- a/src/core/ngx_string.c
+++ b/src/core/ngx_string.c
@@ -1364,7 +1364,12 @@ ngx_utf8_decode(u_char **p, size_t n)
u = **p;
- if (u >= 0xf0) {
+ if (u >= 0xf8) {
+
+ (*p)++;
+ return 0xffffffff;
+
+ } else if (u >= 0xf0) {
u &= 0x07;
valid = 0xffff;
--
Maxim Dounin
http://mdounin.ru/
More information about the nginx-devel
mailing list