This post contains a few of my thoughts on common coding mistakes we see during code reviews when developers deal with wide character arrays. Manipulating wide character strings is reasonably easy to get right, but there are plenty of “gotchas” still popping up. Coders should make sure they take care because a few things can slip your mind when dealing with these strings and result in mistakes.
A little bit of background:
The term wide character generally refers to character data types with a width larger than a byte (the width of a normal char). The actual size of a wide character varies between implementations, but the most common sizes are 2 bytes (i.e. Windows) and 4 bytes (i.e. Unix-like OSes). Wide characters usually represent a particular character using one of the Unicode character sets: in Windows this will be UTF-16 and for Unix-like systems, whose wide characters are twice the size, this will usually be UTF-32.
Windows seems to love wide character strings and has made them standard. As a result, many Windows APIs have two versions: functionNameA and functionNameW, an ANSI version and a wide
charstring version, respectively. If you've done any development on Windows systems, you'll definitely be no stranger to wide character strings.
There are definite advantages to representing strings as wide char arrays, but there are a lot of mistakes to make, especially if you're used to developing on Unix-like systems or you forget to consider the fact that one character does not equal one byte.
For example, consider the following scenario, where a Windows developer begins to unsuspectingly parse a packet that follows their proprietary network protocol. The code shown takes a UTF-16 string length (unsigned
int) from the packet and performs a bounds check. If the check passes, a string of the specified length (assumed to be a UTF-16 string) is copied from the packet buffer to a fresh wide
chararray on heap.
[ … ]
if(packet->dataLen > 34 || packet->dataLen < sizeof(wchar_t)) bailout_and_exit();
size_t bufLen = packet->dataLen / sizeof(wchar_t);
wchar_t *appData = new wchar_t[bufLen];
memcpy(appData, packet->payload, packet->dataLen);
[ … ]
This might look okay at first glance; after all, we're just copying a chunk of data to a new wide
chararray. But consider what would happen if
packet->dataLenwas an odd number. For example, if
packet->dataLen = 11, we end up with
size_t bufLen = 11 / 2 = 5since the remainder of the division will be discarded.
So, a five-element–long wide character buffer is allocated into which the
memcpy()copies 11 bytes. Since five wide chars on Windows is 10 bytes (and 11 bytes are copied), we have an off-by-one overflow. To avoid this, the modulo operator should be used to check that
packet->dataLenis even to begin with; that is:
if(packet->dataLen % 2) bailout()
Another common occurrence is to forget that the NULL terminator on the end of a wide character buffer is not a single NULL byte: it's two NULL bytes (or 4, on a UNIX-like box). This can lead to problems when the usual
len + 1is used instead of the
len + 2that is required to account for the extra NULL byte(s) needed to terminate wide
chararrays, for example:
int alloc_len = len + 1;
wchar_t *buf = (wchar_t *)malloc(alloc_len);
memset(buf, 0x00, len);
wcsncpy(buf, srcBuf, len);
lenwide chars in it, all of these would be copied into
wcsncpy()would not NULL terminate
buf. With normal character arrays, the added byte (which will be a NULL because of the
memset) would be the NULL terminator and everything would be fine. But since wide
charstrings need either a two- or four-byte NULL terminator (Windows and UNIX, respectively), we now have a non-terminated string that could cause problems later on.
Some developers also slip up when they wrongly interchange the number of bytes and the number of characters. That is, they use the number of bytes as a copy length when what the function was asking for was the number of characters to copy; for example, something like the following is pretty common:
int destLen = (stringLen * (sizeof(wchar_t)) + sizeof(wchar_t);
wchar_t *destBuf = (wchar_t *)malloc(destLen);
MultiByteToWideChar(CP_UTF8, 0, srcBuf, stringLen, destBuf, destLen);
[ do something ]
MultiByteToWideAPI is documented at <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>.
The problem with the sample shown above is that the sixth parameter to
MultiByteToWideCharis the length of the destination buffer in wide characters, not in bytes, as the call above was done. Our destination length is out by a factor of two here (or four on UNIX-like systems, generally) and ultimately we can end up overrunning the buffer. These sorts of mistakes result in overflows and they're surprisingly common.
The same sort of mistake can also be made when using "safe" wide
charstring functions, like
wcsncpy(), for example:
unsigned int destLen = (stringLen * sizeof(wchar_t)) + sizeof(wchar_t);
memset(destBuf, 0x00, destLen);
wcsncpy(destBuf, srcBuf, sizeof(destBuf));
sizeof(destuff)for maximum destination size would be fine if we were dealing with normal characters, this doesn't work for wide character buffers. Instead,
sizeof(destBuf)will return the number of bytes in
destBuf, which means the
wcsncpy()call above it can end up copying twice as many bytes to
destBufas intended—again, an overflow.
The other wide
charequivalent string manipulation functions are also prone to misuse in the same ways as their normal
charcounterparts—look for all the wide
charequivalents when auditing such functions as
wcsncpy, etc. There also are a few wide
char-specific APIs that are easily misused; take, for example,
, whichconverts a wide
charstring to a multi-byte string. The prototype looks like this:
size_t wcstombs(char *restrict s, const wchar_t *restrict pwcs, size_t n);
It does bounds checking, so the conversion stops when
nbytes have been written to
sor when a NULL terminator is encountered in
pwcs(the source buffer). If an error occurs, i.e. a wide
pwcscan't be converted, the conversion stops and the function returns
(size_t)-1, else the number of bytes written is returned. The MSDN considers
wcstombs()to be deprecated, but there are still a few common ways to mess when using it, and they all revolve around not checking return values.
If a bad wide character is encountered in the conversion and you're not expecting a negative number to be returned, you could end up under-indexing your array; for example:
i = wcstombs( … ) // wcstombs() can return -1
buf[i] = L'\0';
If a bad wide character is found during conversion, the destination buffer will not be NULL terminated and may contain uninitialized data if you didn't zero it or otherwise initialize it beforehand.
Additionally, if the return value is
n, the destination buffer won't be NULL terminated, so any string operations later carried out on or using the destination buffer could run past the end of the buffer. Two possible consequences are a potential page fault if an operation runs off the end of a page or potential memory corruption bugs, depending on how
destbufis usedlater . Developers should avoid
wcstombs_s()or another, safer alternative. Bottom line: always read the docs before using a new function since APIs don't always do what you'd expect (or want) them to do.
Another thing to watch out for is accidentally interchanging wide
charfunctions. A good example would be incorrectly using
strlen()on a wide character string instead of
wcharstrings are chock full of NULL bytes,
strlen()isn't going to return the length you were after. It's easy to see how this can end up causing security problems if a memory allocation is done based on a
strlen()that was incorrectly performed on a wide char array.
Mistakes can also be made when trying to develop cross-platform or portable code—don't hardcode the presumed length of
wchars. In the examples above, I have assumed
sizeof(wchar_t) = 2; however, as I've said a few times, this is NOT necessarily the case at all, since many UNIX-like systems have
sizeof(wchar_t) = 4.
Making these assumptions about width could easily result in overflows when they are violated. Let's say someone runs your code on a platform where wide characters aren't two bytes in length, but are four; consider what would happen here:
wchar_t *destBuf = (wchar_t *)malloc(32 * 2 + 2);
wcsncpy(destBuf, srcBuf, 32);
On Windows, this would be fine since there's enough room in
destBufffor 32 wide
chars + NULL terminator (66 bytes). But as soon as you run this on a Linux box—where wide
chars are four bytes—you're going to get
4 * 32 + 2 = 130bytes and resulting in a pretty obvious overflow.
So don't make assumptions about how large wide characters are supposed to be since it can and does vary. Always use
sizeof(wchar_t)to find out.
When you're reviewing code, keep your eye out for the use of unsafe wide
charfunctions, and ensure the math is right when allocating and copying memory. Make sure you check your return values properly and, most obviously, read the docs to make absolutely sure you're not making any mistakes or missing something important.