Reply to post: I won't miss debugging UB on the Itanic

Intel to finally scatter remaining ashes of Itanium to the wind in 2021: Final call for doomed server CPU line

Michael Wojcik Silver badge

I won't miss debugging UB on the Itanic

Itanium was responsible for one of those really horrible Heisenbug investigations, back in the day.

Customer reports that once in a while, a server application closes the conversation without returning a response. The log shows the server caught a SIGILL (Illegal Instruction). This only happens on HP-UX on an Itanium box, and only very rarely. But they've caught a few instances of it.

With significant effort we set up a reasonably close environment and try to reproduce. Can't get anything to happen manually, of course, so I set up an automated test with the debugger attached to the server and let it run. SIGILL in C code is most often caused by vectoring through a function pointer with a bogus value, so I spend my time pouring over the source, looking at all the function pointers that might be involved in this code path and their data flows. Can't find anything.

Finally the debugger pops, with a SIGILL. Aha! Except... the instruction in question is valid. Its operands appear to be valid. The HP-UX debugger's support for low-level debugging is ... not great, and my knowledge of Itanium ISA is pretty much whatever I'm digging out of Google, and - shockingly - reading VLIW disassembly is a pain in the ass. But I'm not seeing the problem. And almost all the time we make it down this exact same code path with no problems.

I ask on comp.os.hpux to see if anyone has any ideas. No one does. The problem lingers.

Then, one day, one of our devs who knows HP-UX and Itanium particularly well sends around an email warning about a subtle potential issue with Itanium. The Itanium supports a trap representation for its integer registers - what's known as the NAT (Not A Thing) value. You can initialize a register to NAT, and if you try to use or store the value in that register, the CPU raises a trap.

Oh. And ho. Let's take a quick look, shall we? Why, yes: when the HP-UX kernel sees that trap, it raises SIGILL for the offending process. It's called "Illegal Instruction", but someone at HP decided it should also be used for "Illegal Value". And didn't, say, update the signal(2) man page to reflect that.

OK, so where are these pesky NATs coming from? (NB. Not "pesky gnats", which are usually due to a rotting Apple.) Well, the helpful note from the dev points out that this can happen if the caller of a C function believes the function returns a value, but the function itself actually does not. That's because:

1. There's a dedicated register for returning a value

2. If the function being called is declared as returning a value, the compiler always generates code to read that register on return, even if the caller doesn't use the return value

3. However, a function which is actually defined as void return type doesn't set that register before returning

4. Thus there is a chance that said register will contain a NAT

Usually it won't, but sometimes it will. Impossible to guess the probability because it depends on what else is running at the time and the phase of the moon and your past misdeeds, etc.

OK x 2. Now, how did we end up with a return-type mismatch between caller and callee?

Turns out some of the code in question antedates standard C - it was actually written before 1989. ANSI/ISO-style function definitions and prototype declarations were added later, but for years we still had to support some platforms with pre-standard C implementations. So a bunch of the older source files had the ISO function definitions #ifdef'd, and the headers had the prototypes #ifdef'd.

And the conditional compilation defaulted to not using them. If a macro ("PROTO" or something like that) was not defined, you got K&R definitions and no prototypes. Yes, this should have been conditional on the standard __STDC__ macro instead, but I wasn't around at the time to language-lawyer it.

And someone had screwed up some Imake template file, so that -DPROTO was missing from some of the makefiles. Consequently, we had the source module with a called function being built with PROTO and correctly declaring the function as return-type void; and we had a source module that called it using it without a prototype. Which means defaulting to K&R semantics. Which means implicit int return type.

I thought Itantic's NAT was a Good Thing. I like trap representations; they can be very useful. But HP-UX's handling of it was an obscure nightmare, and one that was all too easy to fall into.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon