Every so often, I get asked questions along the lines of one of these:
Question type #1:
I need a type that is exactly n bytes/bits long. How do I get this?
Question type #2:
On machine x, the size of some type(usually wchar_t) is wrong! How do I fix it?
I actually just got the second question, hence the reason for this post (although that guy won’t see the answer, because he was a dick). The short answer is this:
(Most) types in C and C++ are either only relatively sized or implementation defined.
First things first – in newer versions of C and C++, there are/will be types defined for specific sizes. These are generally named for how many bytes or bits they take up, and they take the form of one of the implementation defined sizes (the compiler writers do the magic to figure out which). So not all types are included, but all of the ones you are used to: integer types, floatint point types, and (wide-)character types.
The reason for this is pretty simple if you look at the definition of these types (I’m paraphrasing from memory, so forgive the lack of detail): they are generally defined to be the most natural representation of said size on the given machine they were compiled on. Other than that, you only get guarantees of relative size, like:
char <= short <= int <= long <= long long
Also, int is likely the pointer size of whatever machine you are using.
Anyways, the first type of question is absolutely valid – what if you really do want something of a specific size? If you are using a version of C/C++ that supports the sized-types, then you are in luck. If you are using a compiler that has some extension to do it, then you are in a bit of luck (the kind of luck that isn’t standards compliant and likely not compatible with other compilers). Or, you could just pony up at use things like Boost, which contain standard and static ways of define types to have certain requirements (like “must be at least this big”), or autoconf, which you can use to generate configure scripts that inform you what types are what size (so you can write macros or typedefs defining types to be what you need).
The second question is the one that kinda makes me angry, especially stated as it was (two key words were “fix”, as in “how do I fix this bug?”, and “proper”, as in “to get proper behavior”). I’m sure you all know just how forgiving I am of people who confuse standard-defined behavior as a bug. Generally, these are the same people who think that Visual Studio is the standard. I suppose we can’t all have the level of IQ required to, well, feed ourselves, but these people really deserve our pity, not our anger.
Anywhoo, the guy’s question was related to the size of wchar_t. Although he didn’t specify compiler, library, or even language (like asking a question along the lines of, “I drive a car. Do you have brakes for it?”), I assume he meant gcc, glibc, and C. The “bug” was that, on Fedora Core 6 (as he said – if he is using glibc, it applies to all gcc/glibc), the size of wchar_t was incorrectly set to the size of an int.
Now, let’s be clear. The standard (C standard) says that the implementation gets to pick the size, and the only requirement is that it has to be big enough to fit the largest code page in the given locale. glibc assumes that you might be using UCS-4 or UTF-32, and uses an int. Even if it doesn’t use an int, it also provides a type called wint_t, which is an integer-sized wide character (effectively).
Anyways, the question had the form of two larger questions. The first is the obvious and annoying “I’m going to assume this is a bug because I have no idea what-the-hell I’m talking about”. Here’s another car analogy to the second:
“I need to replace my turn-signal arm thingy on my car. Do you have a replacement?”
“Uh, sure, but why do you need it replaced?”
“Oh, well, my front left turn signal won’t come on!”
The general form being, of course, diagnosing the problem entirely incorrectly and asking for a solution to that problem. Raymond Chen usually writes about this explicitly a few times a year and uses it as a constant theme throughout his blog. Most of the time, people are asking you to solve a problem that is rather unrelated to the original issue.
In this case, the issue isn’t that wchar_t is the “wrong” size.
If he just wanted a specific encoding, he could set the locale (via setlocale()) and lose space in the case of UTF-16.
Maybe he really wants to save those extra 2 bytes per character. He probably isn’t working in an embedded environment (this is on Fedora Core 6), but he just really wants to save the bytes. In that case, if he needs it so badly, he should be defining his own character type, or using gcc’s -fshort-wchar and rewriting glibc functions (locally, of course) that won’t work with a smaller wchar_t size.
Finally, he might need a smaller size because he is sending data to another program that assumed wchar_t was 2 bytes, which has an obvious answer – the “bug” was that he assumed wchar_t was 2 bytes in the first place, and thus screwed himself over. This is likely the case, as most people at the company I work at assume “Unicode” is just a fancy word for UTF-16.
The moral of the story is this – when you ask questions, always be humble. You will never, ever find a bug in gcc. There are bugs, I’m sure, but the likelihood of you finding one without looking for one is minuscule (I distinctly remember cases where students complained that they found bugs in gcc, oftentimes in cases where they did something obviously wrong and rather stupid).
Always assume that you fucked up, and always explain the original problem and give context. Here’s how the guy’s email should have read:
(polite introduction)
I’m using Fedora Core 6, gcc 4.n, glibc n.n, and writing in C. I really want to store my wide characters in 2 bytes, but I find (by looking through documentation, as I most certainly did at least a perfunctory google search before I emailed this) that glibc sets the size of wchar_t to be able to encode UTF-32/UCS-4. I’m using UTF-16, and I know that I could just lose 2-bytes per character, but [something really important] means I have to have it in 2 bytes.
I see that gcc has an option for -fshort_wchar, but that glibc won’t play nicely with that.
Does anybody have any advice for how to do this?
Since the result of this will likely mean that I have to use a non-glibc version of at least some of the wide character functions, does anyone know of a replacement unicode library with all of the standard wide character functions that work with UTF-16?
If this isn’t the way to go about this, can you tell me what direction I should be headed?
Thanks very much!
[nice-person who does his homework, isn't a douche, and provides necessary information]
So there you have it. Here are my tenets of asking questions:
- Always assume you made a mistake, unless it is painfully obvious that you didn’t (like some command-line tool crashed and threw an exception instead of handling an error condition – happened to me a few weeks ago)
- Always do your homework and try to figure things out
- Always provide information about your system/toolchain/whatever; even code might help.
- Always explain the situation, and ask general questions when those are the real problems (e.g. “I have a 2007 Mazda 6i, and the front-left turn signal won’t turn on (though the front-right turn signal works fine, as does both of the back turn signals). What should I do?”)
- Say “thank you”. You are wasting people’s time, and be appropriately sorry for doing so. On the other hand, don’t say “Thanks in advance” – this pisses some people off, as it presumes that they will be helping you. Saying “Thank you” just means “thanks for taking the time to read this”.