Analysis of hardware faults a million PCs

GEC: Discuss gaming, computers and electronics and venture into the bizarre world of STGODs.

Moderator: Thanas

Post Reply
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Analysis of hardware faults a million PCs

Post by phongn »

Microsoft Research wrote:We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.
The prevalence of CPU faults was surprising to me, as was the improved reliability of laptops (in hindsight that should've been obvious).
User avatar
Brother-Captain Gaius
Emperor's Hand
Posts: 6859
Joined: 2002-10-22 12:00am
Location: \m/

Re: Analysis of hardware faults a million PCs

Post by Brother-Captain Gaius »

Huh. I guess I've just been unlucky then; my desktops have generally been Ol' Reliable while every laptop I've ever had significant contact with has been a shoe-in for the Sledgehammer Solution.
Agitated asshole | (Ex)40K Nut | Metalhead
The vision never dies; life's a never-ending wheel
1337 posts as of 16:34 GMT-7 June 2nd, 2003

"'He or she' is an agenderphobic microaggression, Sharon. You are a bigot." ― Randy Marsh
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Re: Analysis of hardware faults a million PCs

Post by phongn »

Brother-Captain Gaius wrote:Huh. I guess I've just been unlucky then; my desktops have generally been Ol' Reliable while every laptop I've ever had significant contact with has been a shoe-in for the Sledgehammer Solution.
Microsoft was only able to analyze CPU, RAM and disk errors; there could be many other problems with a laptop that went undetected by their metrics (which they freely admit).
User avatar
PeZook
Emperor's Hand
Posts: 13237
Joined: 2002-07-18 06:08pm
Location: Poland

Re: Analysis of hardware faults a million PCs

Post by PeZook »

Well, if they didn't collect data about GPU and bridge failures, then that explains it, doesn't it? :D
Image
JULY 20TH 1969 - The day the entire world was looking up

It suddenly struck me that that tiny pea, pretty and blue, was the Earth. I put up my thumb and shut one eye, and my thumb blotted out the planet Earth. I didn't feel like a giant. I felt very, very small.
- NEIL ARMSTRONG, MISSION COMMANDER, APOLLO 11

Signature dedicated to the greatest achievement of mankind.

MILDLY DERANGED PHYSICIST does not mind BREAKING the SOUND BARRIER, because it is INSURED. - Simon_Jester considering the problems of hypersonic flight for Team L.A.M.E.
User avatar
TronPaul
Padawan Learner
Posts: 232
Joined: 2011-12-05 12:12pm

Re: Analysis of hardware faults a million PCs

Post by TronPaul »

I haven't read the study in full yet, but it seems to me that Microsoft did not include the more likely failures: power supply, motherboard, or GPU. I can't find what their data set size is either in the paper.

I bet most of the hardware failures in the study are due to overheating. I'd be more interested in seeing the difference between computers that are adequately cooled and inadequately cooled.
If it waddles like a duck and it quacks like a duck, it's a KV-5.
Vote Electron Standard, vote Tron Paul 2012
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Re: Analysis of hardware faults a million PCs

Post by phongn »

TronPaul wrote:I haven't read the study in full yet, but it seems to me that Microsoft did not include the more likely failures: power supply, motherboard, or GPU. I can't find what their data set size is either in the paper.
Their analysis tools cannot report those sorts of faults. §4 says that there were c. 950,000 machines.
I bet most of the hardware failures in the study are due to overheating. I'd be more interested in seeing the difference between computers that are adequately cooled and inadequately cooled.
Alas, Windows doesn't really grab that data :/ Still, the levels of unreliability are much higher than I expected.
User avatar
Sarevok
The Fearless One
Posts: 10681
Joined: 2002-12-24 07:29am
Location: The Covenants last and final line of defense

Re: Analysis of hardware faults a million PCs

Post by Sarevok »

Can someone elaborate CPU faults ? Is it the hardware developing a permanent fault or an error in CPU design causing it to execute an operation different way leading to software crashes ?
I have to tell you something everything I wrote above is a lie.
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Re: Analysis of hardware faults a million PCs

Post by phongn »

Sarevok wrote:Can someone elaborate CPU faults ? Is it the hardware developing a permanent fault or an error in CPU design causing it to execute an operation different way leading to software crashes ?
Both. The study defines a CPU fault as when it issues a machine-check exception.
Post Reply