Environment\ Compiler Clang on Termux on Samsung’s One UI 7 on a Samsung Galaxy Tab A9+
The title is an assumption on it’s own, so feel free to correct it/me!
I was experimenting with sanitizing user input, read to a character array with fgets. Specifically, I was trying to have a for loop remove (skip) certain input. Here is the code:
for (n = strlen(input) - 1; n >= 0; n--) { if (input[n] >= 0x30 && input[n] <= 0x39 || input[n] == ' ' || input[n] == '\t') { input[n] = 0x18; } }
While the program does behave as I want it to, I don’t understand why it seemlingy by default understands that the various hex codes refer to the character encoding as per the ASCII table. I cated my tablet’s filesystem encoding at /sys/fs/f2fs/dm-44/encoding, which yielded UTF-8. If I understand it correctly, the first 128 code points of Unicode are the same as ASCII’s. But according to this article on Wikipedia, there are no hexadecimal references in Unicode, only octal and decimal.
If the underlying filesystem uses UTF-8, and Unicode code points are not referred to by hex, how then does my compiler (Clang) understand what ASCII code points I’m referring to?
Is there some conversion going on under the hood that I am not aware of? I did find a libxml2/libxml/encoding.h, which contains comments about some conversion to and from UTF-8. Is this it? I can’t make head or tails of it because of my limited C knowledge…


You are inspecting a byte, which contains 0x39. That it corresponds to “9” doesn’t matter at all to the compiler at all. If you were on a system that didn’t match “9” <=> 0x39 it would still replace those byte values.
Thanks! I’ll have to noodle this around a bit because it’s hard for me to understand. xD
I dunno if this helps, but this screenshot shows the memory view of a program with a string. The hex representation in the middle is what is actually stored in memory. Each pair is one byte/char, and that is what your
input[n] <= 0x39is comparing against.