Environment\ Compiler Clang on Termux on Samsung’s One UI 7 on a Samsung Galaxy Tab A9+
The title is an assumption on it’s own, so feel free to correct it/me!
I was experimenting with sanitizing user input, read to a character array with fgets. Specifically, I was trying to have a for loop remove (skip) certain input. Here is the code:
for (n = strlen(input) - 1; n >= 0; n--) { if (input[n] >= 0x30 && input[n] <= 0x39 || input[n] == ' ' || input[n] == '\t') { input[n] = 0x18; } }
While the program does behave as I want it to, I don’t understand why it seemlingy by default understands that the various hex codes refer to the character encoding as per the ASCII table. I cated my tablet’s filesystem encoding at /sys/fs/f2fs/dm-44/encoding, which yielded UTF-8. If I understand it correctly, the first 128 code points of Unicode are the same as ASCII’s. But according to this article on Wikipedia, there are no hexadecimal references in Unicode, only octal and decimal.
If the underlying filesystem uses UTF-8, and Unicode code points are not referred to by hex, how then does my compiler (Clang) understand what ASCII code points I’m referring to?
Is there some conversion going on under the hood that I am not aware of? I did find a libxml2/libxml/encoding.h, which contains comments about some conversion to and from UTF-8. Is this it? I can’t make head or tails of it because of my limited C knowledge…


@akunohana Apologies if I’m missing something, but 0x30, 0x39, and 0x18 are merely hex notation for integers. You could have written 48, 57, and 24 instead for the same effect. Or you (probably?) could have used char literals like ‘0’, ‘9’ (I guess there isn’t one for U+0018).
UTF-8 determines how the characters are encoded, i.e. what sequence of numbers (ints) they are represented by. So there’s no special understanding of Unicode or hex going on here; you’re comparing numbers (as you should).
LOL (please excuse my early 2000’s slang)
Thanks for that clarification.
What I still don’t quite understand is, does this mean that there is no “checking against an ASCII table” or “lookup” going on? But how then does it know that 0x18 is CAN (disregard)?
Imagine you want to cypher a text, you can have a table with a column with the char you really want to be the meaning and another column with the obscured representation, what you write is the obscured thing, but when you read you take out that table to translate to the original meaning. Computers do sort of the same thing with the translation from bits to letters. But there are many tables, it depends on the languages to read and write.
@akunohana The compiler doesn’t need to know that 0x18 is CAN; that knowledge is embedded in whatever decided that the data you’re inspecting is UTF-8 or ASCII.
The content of your original post has been replaced with a link that I can’t open so I can’t go back and confirm where you said the data was coming from. But if the data was some other exotic encoding then 0x18 would mean something else in the context of that data.
Oh, maybe I messed something up when editing… Here’s what I wrote:
What it “that thing” that decided the encoding?
@akunohana OK that’s a pretty good question then. In that case the encoding is determined by your terminal (or if not terminal then execution environment). Try invoking
env(orlocale) and looking at LANG and LC_ALL; those should tell you what your terminal accepts as input and passes along to your program.Sweet! I think this is the answer that I was looking for, although my post is poorly phrased. 😅
Does this mean that in theory, there could arise problems with portability? 98-ish percent of all systems use Unicode, but if I were to run my program on an obscure system whose underlying character encoding is not Unicode or some superset of ASCII, I assume it would return other values?
@akunohana Yes! Although I would say that ASCII is a pretty safe assumption and it’s really anything above the top of ASCII that you need to account for (document as a requirement for your program, or take steps to ensure the OS uses the right encoding if you are packaging something for distribution)