Page 1 of 1

Analysis of hardware faults a million PCs

Posted: 2012-06-28 10:04am
by phongn
Microsoft Research wrote:We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.
The prevalence of CPU faults was surprising to me, as was the improved reliability of laptops (in hindsight that should've been obvious).

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 10:12am
by Brother-Captain Gaius
Huh. I guess I've just been unlucky then; my desktops have generally been Ol' Reliable while every laptop I've ever had significant contact with has been a shoe-in for the Sledgehammer Solution.

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 10:36am
by phongn
Brother-Captain Gaius wrote:Huh. I guess I've just been unlucky then; my desktops have generally been Ol' Reliable while every laptop I've ever had significant contact with has been a shoe-in for the Sledgehammer Solution.
Microsoft was only able to analyze CPU, RAM and disk errors; there could be many other problems with a laptop that went undetected by their metrics (which they freely admit).

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 10:43am
by PeZook
Well, if they didn't collect data about GPU and bridge failures, then that explains it, doesn't it? :D

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 10:53am
by TronPaul
I haven't read the study in full yet, but it seems to me that Microsoft did not include the more likely failures: power supply, motherboard, or GPU. I can't find what their data set size is either in the paper.

I bet most of the hardware failures in the study are due to overheating. I'd be more interested in seeing the difference between computers that are adequately cooled and inadequately cooled.

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 11:12am
by phongn
TronPaul wrote:I haven't read the study in full yet, but it seems to me that Microsoft did not include the more likely failures: power supply, motherboard, or GPU. I can't find what their data set size is either in the paper.
Their analysis tools cannot report those sorts of faults. ยง4 says that there were c. 950,000 machines.
I bet most of the hardware failures in the study are due to overheating. I'd be more interested in seeing the difference between computers that are adequately cooled and inadequately cooled.
Alas, Windows doesn't really grab that data :/ Still, the levels of unreliability are much higher than I expected.

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 11:13am
by Sarevok
Can someone elaborate CPU faults ? Is it the hardware developing a permanent fault or an error in CPU design causing it to execute an operation different way leading to software crashes ?

Re: Analysis of hardware faults a million PCs

Posted: 2012-06-28 11:25am
by phongn
Sarevok wrote:Can someone elaborate CPU faults ? Is it the hardware developing a permanent fault or an error in CPU design causing it to execute an operation different way leading to software crashes ?
Both. The study defines a CPU fault as when it issues a machine-check exception.