The actual report from Google.Hard disk test 'surprises' Google
The impact of heavy use and high temperatures on hard disk drive failure may be overstated, says a report by three Google engineers.
The report examined 100,000 commercial hard drives, ranging from 80GB to 400GB in capacity, used at Google since 2001.
The firm uses "off-the-shelf" drives to store cached web pages and services.
"Our data indicate a much weaker correlation between utilisation levels and failures than previous work has suggested," the authors noted.
A wide variety of manufacturers and models were included in the report, but a breakdown was not provided.
Widely-held belief
There is a widely held belief that hard disks which are subject to heavy use are more likely to fail than those used intermittently. It was also thought that hard drives preferred cool temperatures to hotter environments.
The authors wrote: "We expected to notice a very strong and consistent correlation between high utilisation and higher failure rates.
"However our results appear to paint a more complex picture. First, only very young and very old age groups appear to show the expected behaviour."
A hard disk was described as having "failed" if it needed to be replaced.
The report was compiled by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso, and was presented to a storage conference in California last week.
In the report the authors said Google had developed an infrastructure which collected "vital information" about all of the firm's systems every few minutes.
'Essentially forever'
The firm then stores that information "essentially forever".
Google employs its own file system to organise the storage of data, using inexpensive commercially available hard drives rather than bespoke systems.
Hard drives less than three years old and used a lot are less likely to fail than similarly aged hard drives that are used infrequently, according to the report.
"One possible explanation for this behaviour is the survival of the fittest theory," said the authors, speculating that drives which failed early on in their lifetime had been removed from the overall sample leaving only the older, more robust units.
The report said that there was a clear trend showing "that lower temperatures are associated with higher failure rates".
"Only at very high temperatures is there a slight reversal of this trend."
But hard drives which are three years old and older were more likely to suffer a failure when used in warmer environments.
"This is a surprising result, which could indicate that data centre or server designers have more freedom than previously thought when setting operating temperatures for equipment containing disk drives," said the authors.
The report also looked at the impact of scan errors - problems found on the surface of a disc - on hard drive failure.
"We find that the group of drives with scan errors are 10 times more likely to fail than the group with no errors," said the authors.
They added: "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan errors."
This was posted on SH/SC; someone sent it around their company and one of their 'storage engineers' responded as follows:
I’ve just finished skimming this and while I want to read it more closely, please be careful what you do with this information. As a storage engineer, I noticed very quickly a number of factors that were missed in their analysis that are major components of analytical failure prediction models used by the industry – factors which I have found in my experience to be highly relevant.
Drive Age – Absolutely true. It is well known that 70%+ of drive failures occur in the first six months. Most of these are electronic failures, by the way, relating to chip failure or poor solder connections.
Disk Activity/Utilization: Their information evaluates only within the range of activity they observe in their environment, not the range of activity possible. This is a dramatic weakness of the study. For example, if their actual average utilization on the high end is only in the 30% busy rate (and despite what they said it is a common practice to utilize drive busy rate statistics here), their top 25th percentile will barely exceed the expectations for consumer class devices. Further, they did not measure read/write rate correlations which is extremely important to measure. Where drive busy rates are run at or near 100% with balanced read/write ratios (near 50%), consumer class devices do fail much more often than enterprise class devices, and these failures are almost always mechanical or platter in nature. These kinds of tests are done in large scale benchmarking environments by all major drive and OEM vendors. Given Google’s data classification style, the majority of their data has to be sequential write/random read with a very very high read ratio – probably exceeding 98% overall for both data and index categories. Their paper neither reports this factor nor takes it into account.
Temperature: They start this off nicely by noting that temperature deltas of 15 degrees Celsius are often quoted as issues then they immediately move into periodic average thermal analysis. What in fact the industry knows is that temperature has little effect on the running temperature of the drive as long as it is reasonably stable. There are two critical factors. First, if the drive gets too hot, you can have failure of the electronics. Duh. The paper acknowledges this but does not factor it significantly. Secondly, it is the temperature differential over a short period of time that is important, not the overall temperature differential, or the actual temperature differential at which the drive operates. This is simple physics. As the platter heats up it expands slightly which affects the ability of the heads to write and read tracks accurately. The drives are designed to realign their track vectors periodically (and when certain types of I/O errors occur). However, they are vectors which assume that the platter geometry is predictable throughout. If a drive is being heated rapidly, the inner and outer tracks will have a significant thermal difference resulting in unpredictable geometry. This will induce a drive failure. Ironically in most cases, the drive is recoverable, but it will have to be thermally stabilized then low-level formatted to correct geometry and inter-track sequences that may have been overwritten by bad geometry calculations. They do note that drives run at extremely low temperatures show additional failure rates that they did not expect. This is in fact related to the same issue; if the inflow air is extremely cold, but the drive is generating sufficient heat, this may cause a large thermal differential between front of the drive enclosure and the rear of the drive enclosure. This will induce a slight shift in the alignment of the spindle bearings, which will in turn induce platter wobble and a higher rate of drive failure, especially with higher RPM drives.
SMART: SMART by definition assumes that the primary failure we are watching for is platter integrity failure. I did not see anything in the analysis that acknowledged this as a primary issue.
As to the methodologies ... it is a great method to use for a company to analyze its own failure rates; but it is far too dependent on the individual customer profile to make any broad conclusions about how true the conclusions would be for anyone else.
Overall an interesting read.