
When it comes to hardware failures in PCs, servers or storage arrays, I've always operated with the knowledge that any component with moving parts is the component most likely to fail and the faster the component moves, the lesser the mean time between failure and more likely that it will fail. That said, it's no revelation that hard drives, spinning at speeds of 5400, 7200 or 10000 RPM, are going to fail; which is why RAID and global hot spare drives should always be used to automatically respond to a failure and prevent the loss of data.![]()
Increasing the availability of a storage array usually includes providing other redundant components such as controllers, fans, physical data paths, power supplies, etc., but again the disks were always assumed to be the most prone to failure. Well, I just finished reading an article in The Register titled "You don't know disk about storage failures" that discussed a report compiled by the University of Illinois Department of Computer Science and Network Appliance illustrating that while disk failure contributes to the majority of storage subsystem failures, other hardware/software should not be ignored when purchasing a storage array.
The report titled "Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics" can be viewed in its entirety here, but I'll list the summary of the causes that were identified and their implications:
- In addition to disk failures that contribute to 20-55% of storage subsystem failures, other components such as physical interconnects (including shelf enclosures) and protocol stacks also account for significant percentages (27-68% and 5-10%, respectively) of failures. Due to these component failures, even though storage systems of certain types (e.g., low-end primary systems) use more reliable disks than some other types (e.g., near-line backup systems), their storage subsystems exhibit higher failure rates.
These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g., RAID) is not enough. We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components.
- Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong correlations, (i.e. after one failure, the probability of additional failures of the same type is higher). In addition, failures also exhibit bursty patterns in time distribution, (i.e. multiple failures of the same type tend to happen relatively close together).
These results motivate a revisiting of current resiliency mechanisms such as RAID that assume independent failures. These results also motivate development of better resiliency mechanisms that can tolerate multiple correlated failures and bursty failure behaviors.
- Storage subsystems configured with two independent interconnects experienced much (30-40%) lower annualized failure rates (AFRs) than those with a single interconnect.
This result indicates the importance of interconnect redundancy in the design of reliable storage systems.
- RAID groups built with disks spanning multiple shelf enclosures show much less bursty failure patterns than those built with disks from the same shelf enclosure.
This indicates that the former is a more resilient solution for large storage systems.
So, while this may be a common knowledge for most storage folks, remember that RAID isn't a panacea and that attention should given to other components in your storage subsystems when considering redundancy.







Comment Preview