No FUD, ex-employee chooses to speak
First, let me tell you why I post anonymously. Too many people in the hardware areas know
me, and some of them will be able to deduce who I am just from what I write below. Don't get me wrong, I loved working at SUN, but there are huge problems that layoffs and yet another reorganization aren't going to fix until the people in charge recognize that they are the actual problem.
Let's review shall we, a bit of history first.
UltraSPARC-V (code name - Millenium) was indeed too little, too late. The 1.0 tapeout mask set was indeed sent to TI, when the cancellation came down from on-high, and the layoffs began.
Gutted the processor hardware engineering department.
Let's buy Afara (good move), as the Afara people mainly stayed, and out popped
Niagara I and Niagara II and VF on-time.
Moving to 2008:
UltraSPARC-T2+ (Victoria Falls) - There is still a sneaky little bug with a CSR (control status register) i.e. the NCX timeout register, but luckily another CSR has a bit that when turned on actually managed to correct the problem. Slight performance degradation (See earlier reg article, now you know why). The Zambezi chip was late (used in the 4-way systems), or this problem might have been caught during pre-silicon verification.
New philosophies began to creep into the picture (verification began to be performed by pure random coverage, and directed functional diagnostics were left to rot, even though they had huge numbers of pre-canned diagnostics already written)
UltraSPARC-T3 (KT) - Tapeout was scheduled for early September, and the project was nowhere close to being on track. Yes, 16 cores with 16 threads per core, with provisions to make an 8-way system. Reasons for delay: The 0-in coverage and assertions weren't written, and 16 core model builds weren't working two months before the supposed tape-out date.
Now, if SUN had a lick of sense, they would have moved the project manager from UltraSPARC-T2+ to UltraSPARC-T3. (UltraSPARC-T2+ did tapeout on time.) I won't go into too many details at this time, but it was more office politics, clashing egos, and the SUN culture than anything else.
Which only leaves ROCK. Rick Hetherington states "post-silicon analysis". This is normally known as the validation phase, although it's highly questionable if SUN knows the difference between verification and validation. Post-Silicon analysis in this case means looking at actual silicon. Tapeout 1.0 yielded actual silicon. Transactional memory did not work. They even had a problem with a legacy register, the "Y" register. There were a huge number of 1.0 bugs. They had new silicon about the time the July 2008 layoffs began. So let's do a bit of post-mortem on ROCK. First, ROCK underwent pre-silicon verification using only random test generation. (This would be ok, as long as your random code generators are top-notch) SUN's simply aren't and many of them are still being tweaked on. There are only a few people that really know how to use them to their maximum potential and the documentation on how to use them is horrible. Generally speaking, SUN's technical documentation for their processors is horrible. Anyone who has actually sat down and read a Programmer's Reference Manual for SUN knows this.
My post-mortem fixes:
Hire a large number of technical writers. All employees must document and formalize procedures fully so anyone can do your job. Demonstrate the correctness of your documentation through independent verification. Switching to fully-randomized testing has
already proven itself to be a failure. Hire people that can actually write SPARC assembly code.
Create a huge bank of directed code diagnostics. (To test the chips on the actual chip tester)
Random testing is great for catching a lot of bugs in the pre-silicon verification phase, but
it is horrible to take these same diags and use them on chip testers to screen actual parts. This
is where well thought out directed diags can make a world of difference in terms of coverage.
Organize a team that is truly in charge of the tester diagnostics. Switch to industry-standard
verification procedures. Send the design and verification engineers back to school on the company dime.
Finally, management must change it's thinking, and using the principles of Lean and Kaizen wouldn't hurt SUN in the least, starting with open and honest communication.