u_char vs char (was: [PATCH] Removed the unsafe ngx_memcmp() wrapper for memcmp(3))

Alejandro Colomar alx.manpages at gmail.com
Tue Nov 8 11:15:51 UTC 2022


Hello!

On 11/8/22 10:50, Maxim Dounin wrote:
>> Even if it's a bit off-topic, I'm very curious about the reason for using
>> u_char.  It definitely requires a lot of extra work compared to 'char *': casts,
>> type-safety, reviewing that code just works when workarounding/disabling the
>> compiler warnings.  I'm guessing it was also some workaround for broken old
>> implementations and it has just continued like that for consistency, but am
>> curious if there are other better reasons.  Certainly, ASCII characters behave
>> well (at least nowadays) independently of the signedness of char, and usually
>> one doesn't do arithmetic with characters in strings.
> 
> Using signed chars for strings simply does not work as long as you
> consider 8-bit strings.  It results in wrong sorting unless you do
> care to compare characters as unsigned, requires careful handling
> of all range comparisons such as "ch <= 0x20", does not permit
> things like "ch < 0x80" or "c >= 0xc0", makes impossible to use
> table lookups such as "basis64[s[0]]" (all snippets are from nginx
> code).
> 
> The fact that signedness of "char" is not known adds even more
> fun: you can't really do anything without casting it to either
> unsigned char or signed char.
> 
> In general, using "char" for strings is a well known source of
> troubles at least in the Cyrillic world.  Writing the code which
> works with arbitrary chars is tricky and error-prone as long as
> you are doing anything more complex than just calling libc
> functions.  On the other hand, casts for external functions can be
> easily abstracted in most cases, and always trivial.

Hmm, yeah, it makes sense.  The libc design around char instead of u_char is 
broken by design, and the requirement that libc macros need to be called with a 
cast (e.g., toupper(3)) shows that.

If nginx does things with chars other than calling libc, it makes a lot of sense 
to also use u_char.

Thanks for the rationale!  It certainly helps to understand why it was done that 
way.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20221108/034cf527/attachment.bin>


More information about the nginx-devel mailing list