Re: L3 cache ECC errors
Ouch, ouch, and OUCH!
If you are seeing a steady problem while ordering less than a thousand CPUs, then this is a HUGE deal. A 1/1000 escape should stop the line. Seriously, talk to your management about going to the press with this.
BECAUSE you are way, way too small for Intel to care about. And..Intel's parts are everywhere. This isn't going to just affect gamers & miners.
As to what you can do in the mean time, a couple of things come to mind immediately, sorry if you are already going there.
1) Double errors happen roughly at the square of the rate of single errors. Screen on corrected bits over a certain level, don't wait for the MCE.
2) Check that you are actually running the parts in-spec. I know this sounds insane, but if your workload drives the temperature of the core too high, then you are running it out of spec. Of course, manufacturers do significant work to predict what the highest workload-driven effect can be, but don't trust it. Also, read the specs on the temperature sensors on the part very carefully. Running the core at X temp is not the same has having the senors report X temp.
3) It sounds like it would be worthwhile to spend some time reducing your burn-in run. (I come from the manufacturer's side of things, so the economics are WAY different than I'm used to, but still...)
a) Part of why I wanted you to check the temp spec so carefully is that if you are sure that you are in spec, you might be able to run at a slightly higher ambient temp & still be in spec. Why would you want to do this? Because the fails will happen faster if you do.
b) Try to identify which part(s) of your workload are triggering the fails, and just run that part over & over. I had test code that could trigger the 750 Medal of Honor bug after 8-10 hours. Eventually, I got it to fail in 1 second.
c) Try to see if there is a commonality to the memory locations that fail. As I mentioned with the Nintendo bug for the 750, it might be possible to target just a handful of cache lines to activate the failure.