« Threat Thursday: U.S. Air Force's Cyber Command | Main | Daylight Savings Time, Part Deux »

Mar 7
Storage Failures: Are Disks Really to Blame?

When it comes to hardware failures in PCs, servers or storage arrays, I've always operated with the knowledge that any component with moving parts is the component most likely to fail and the faster the component moves, the lesser the mean time between failure and more likely that it will fail.  That said, it's no revelation that hard drives, spinning at speeds of 5400, 7200 or 10000 RPM, are going to fail; which is why RAID and global hot spare drives should always be used to automatically respond to a failure and prevent the loss of data.SAN.gif

Increasing the availability of a storage array usually includes providing other redundant components such as controllers, fans, physical data paths, power supplies, etc., but again the disks were always assumed to be the most prone to failure.  Well, I just finished reading an article in The Register titled "You don't know disk about storage failures" that discussed a report compiled by the University of Illinois Department of Computer Science and Network Appliance illustrating that while disk failure contributes to the majority of storage subsystem failures, other hardware/software should not be ignored when purchasing a storage array.

The report titled "Are Disks the Dominant Contributor for Storage Failures?  A Comprehensive Study of Storage Subsystem Failure Characteristics" can be viewed in its entirety here, but I'll list the summary of the causes that were identified and their implications:

  • In addition to disk failures that contribute to 20-55% of storage subsystem failures, other components such as physical interconnects (including shelf enclosures) and protocol stacks also account for significant percentages (27-68% and 5-10%, respectively) of failures.  Due to these component failures, even though storage systems of certain types (e.g., low-end primary systems) use more reliable disks than some other types (e.g., near-line backup systems), their storage subsystems exhibit higher failure rates.

These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g., RAID) is not enough.  We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components.

  • Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong correlations, (i.e. after one failure, the probability of additional failures of the same type is higher).  In addition, failures also exhibit bursty patterns in time distribution, (i.e. multiple failures of the same type tend to happen relatively close together).

These results motivate a revisiting of current resiliency mechanisms such as RAID that assume independent failures.  These results also motivate development of better resiliency mechanisms that can tolerate multiple correlated failures and bursty failure behaviors.

  • Storage subsystems configured with two independent interconnects experienced much (30-40%) lower annualized failure rates (AFRs) than those with a single interconnect.

This result indicates the importance of interconnect redundancy in the design of reliable storage systems.

  • RAID groups built with disks spanning multiple shelf enclosures show much less bursty failure patterns than those built with disks from the same shelf enclosure.

This indicates that the former is a more resilient solution for large storage systems.

So, while this may be a common knowledge for most storage folks, remember that RAID isn't a panacea and that attention should given to other components in your storage subsystems when considering redundancy.

related entries


0 Comments/Trackbacks




submit a trackback

TrackBack URL for this entry:

post a comment

Name, Email Address, and URL are not required fields.





Comment Preview

« Threat Thursday: U.S. Air Force's Cyber Command | Main | Daylight Savings Time, Part Deux »

Advertise


Advertise Here

sponsored ads



Incredible Hall of Acclaim.

subscribe


Prefer Email?
Subscribe below-

Enter your Email:


Powered by FeedBlitz What's this?

Current News

Support This Blog

business social media

Use these fast growing business social media sites to promote your business, feature your products, spotlight your business leaders, create links, and drive traffic back to your company site, all for free!

BIZZlogos - Add your logo - free link to your site
BIZZphotos - Add photos of your products and people
BIZZprofiles - Submit your profile and build your online visibility
BIZZspotlight - Spotlight your business with free links
BIZZvideos - Videos about businesses, products and business people.
BIZZbites - "Digg" for Business - Submit your articles and posts

know more media network

View Network Map

Network Feed List (OPML)

Know More Media Network
Feed


we support unitus

PRWeb

Influencer



ITechTips is a member of the Know More Media network of business related blogs.

Here are some current headlines from some of our business publications:

ProductivityGoal

CallCenterScript

AdHurl

TheBizofKnowledge

LandingTheDeal

CustomersAreAlways

HealthCareVox

WebMetricsGuru

TheInsurancePolicy

MarketingBlurb